Rosetta@home

Minirosetta 3.14

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : Minirosetta 3.14

Sort
AuthorMessage
David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 961
ID: 14
Credit: 2,369,109
RAC: 1,381
Message 70554 - Posted 15 Jun 2011 23:51:42 UTC

This update includes a number of new and updated protocols. Please report bugs and issues here.

Details about the new protocols will be posted in separate threads as we start submitting jobs.

Kenny Frew

Joined: May 16 08
Posts: 2
ID: 259187
Credit: 98,306
RAC: 0
Message 70556 - Posted 16 Jun 2011 1:32:47 UTC - in response to Message ID 70554.
Last modified: 16 Jun 2011 1:33:35 UTC

This update includes a number of new and updated protocols. Please report bugs and issues here.

Details about the new protocols will be posted in separate threads as we start submitting jobs.



Completed w.u. will not upload.
____________

cnick6

Joined: May 30 06
Posts: 24
ID: 85398
Credit: 5,296,712
RAC: 6,626
Message 70557 - Posted 16 Jun 2011 1:36:01 UTC

Yup, me too. Servers are down.
____________

Kenny Frew

Joined: May 16 08
Posts: 2
ID: 259187
Credit: 98,306
RAC: 0
Message 70558 - Posted 16 Jun 2011 1:47:48 UTC

Ok - It uploaded manually and it reported. Next is downloading slowly. Thanks
____________

David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 961
ID: 14
Credit: 2,369,109
RAC: 1,381
Message 70559 - Posted 16 Jun 2011 1:59:54 UTC

The servers are going to be stressed for a bit as people try to download the new application. It should settle with time.

Chilean Profile
Avatar

Joined: Oct 16 05
Posts: 651
ID: 5008
Credit: 10,394,263
RAC: 2,476
Message 70561 - Posted 16 Jun 2011 5:35:33 UTC

Nice to see some progress being done :)

Sorta off-topic, are you guys planning to upgrade y'lls (my southern has stuck to me for life) BOINC version?
____________

Samson

Joined: May 23 11
Posts: 8
ID: 420170
Credit: 257,870
RAC: 0
Message 70562 - Posted 16 Jun 2011 7:44:51 UTC

Inquiring minds, me, would like to know what advantages Pi brings us
over 2.17 ?

Is there a list somewhere ?

Also, who's the genius that dubbed this 3.14 ?

:)

David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 961
ID: 14
Credit: 2,369,109
RAC: 1,381
Message 70563 - Posted 16 Jun 2011 15:17:04 UTC

Chilean, we are not planing to upgrade the BOINC version anytime soon but we will be upgrading the hardware in a few weeks or sooner.

Samson, it is version 3.14 due to many iterations of testing on our testing project, Ralph@home. This update includes a number of new protocols and also more methods (we call them movers) are available in our scripting protocol. We'll explain them in separate threads as they start getting used.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 70564 - Posted 16 Jun 2011 23:02:16 UTC

Hi.

I just rejoined and i have no graphics for the new app, with a task running the button is greyed out.

Other projects that have graphics are showing O.K. and when i ran here in the past the graphics worked fine, on all rigs.

Rig is Ubuntu 10.04lts x64 & Boinc is 6.10.58 x64.

____________


David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 961
ID: 14
Credit: 2,369,109
RAC: 1,381
Message 70565 - Posted 16 Jun 2011 23:12:53 UTC

P.P.L.,

Unfortunately we omitted the graphics app for the linux platform on this update because of time constraints. It has been so long since our last graphics app update, that our build machine set up for building the graphics no longer worked with the current version of minirosetta. We'll try to bring it back when we have more time to look into it.

Michael Gould

Joined: Feb 3 10
Posts: 39
ID: 368947
Credit: 1,149,075
RAC: 0
Message 70579 - Posted 18 Jun 2011 23:40:45 UTC

Transition was seamless here, looking forward to hearing about new capabilities. Nice job, all.

TPCBF

Joined: Nov 29 10
Posts: 108
ID: 403518
Credit: 1,858,486
RAC: 2,912
Message 70580 - Posted 18 Jun 2011 23:55:00 UTC

Since the 3.14 version, I get far more than usual randomly "hanging" WUs and validation errors with no granted credit... :-(

Ralf

David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 961
ID: 14
Credit: 2,369,109
RAC: 1,381
Message 70581 - Posted 19 Jun 2011 4:04:08 UTC

can you post the names of the workunits that are randomly hanging? the validation errors should eventually get credit.

TPCBF

Joined: Nov 29 10
Posts: 108
ID: 403518
Credit: 1,858,486
RAC: 2,912
Message 70583 - Posted 19 Jun 2011 7:18:30 UTC - in response to Message ID 70581.

can you post the names of the workunits that are randomly hanging? the validation errors should eventually get credit.
Have to go through the logs of different machines tomorrow morning, seems to be more than one batch.
The worst is that the WU completely locks the BOINC manager out, the jobs don't "release" and switch to another project after the default of 2h as they usually do, blocking the whole system on single core CPUs when I don't notice it... :-(

Here's the worst one of today, hung at about 14%, killed it after +17h runtime, as "normal" jobs run 3-7h on that machine...

casd_sgr145_boinc_3busA_23.nonlocal.pctid_0.11.tmscore_0.70096._nonlocal_tex_IGNORE_THE_REST_27533_890_0

Ralf

TPCBF

Joined: Nov 29 10
Posts: 108
ID: 403518
Credit: 1,858,486
RAC: 2,912
Message 70586 - Posted 19 Jun 2011 16:27:09 UTC - in response to Message ID 70583.

can you post the names of the workunits that are randomly hanging? the validation errors should eventually get credit.
Have to go through the logs of different machines tomorrow morning, seems to be more than one batch.
The worst is that the WU completely locks the BOINC manager out, the jobs don't "release" and switch to another project after the default of 2h as they usually do, blocking the whole system on single core CPUs when I don't notice it... :-(

Here's the worst one of today, hung at about 14%, killed it after +17h runtime, as "normal" jobs run 3-7h on that machine...

casd_sgr145_boinc_3busA_23.nonlocal.pctid_0.11.tmscore_0.70096._nonlocal_tex_IGNORE_THE_REST_27533_890_0
And this one I just killed this morning, stuck at 46% after 14h39m over night...
ilv_hr41_all_boinc_3h8kA_73.nonlocal.pctid_0.18.tmscore_0.49037._nonlocal_tex_IGNORE_THE_REST_27535_3084
(BOINC the only running apps for days now, on a 2GB RAM machine, no indication of RAM issues)

And this one looks hung too, 15.44% after 5h41m
casd_sr10_boinc_3e0mC_3.nonlocal.pctid_0.52.tmscore_0.68362._nonlocal_tex_IGNORE_THE_REST_27537_2047
(also XPSP3, with 3GB RAM and a couple of apps open (email, web browser) over night, as usual)

All those jobs showed a "time to completion" of about 3h when downloaded, which usually is within 10% high/low of the actual runtime...

Ralf

TPCBF

Joined: Nov 29 10
Posts: 108
ID: 403518
Credit: 1,858,486
RAC: 2,912
Message 70587 - Posted 19 Jun 2011 17:41:35 UTC - in response to Message ID 70586.

And this one looks hung too, 15.44% after 5h41m
casd_sr10_boinc_3e0mC_3.nonlocal.pctid_0.52.tmscore_0.68362._nonlocal_tex_IGNORE_THE_REST_27537_2047
An hour later, the WU hasn't progressed 1/1000 of a %, only "time to completion" increased now to 14h, far away from the 3:05h runtime estimate when downloaded.
The process minirosetta_3.14_windows_intelx86.exe sits in the task manager using 295108K of RAM and 0% CPU, with +1GB of physical RAM available and a WCG WU running on the second core, with an average CPU usage about 40% while editing this on a Firefox browser tab...

I haven't looked through all the task that hung in the last 3 days or so, but what I referred to so far are 3 within a few hours (over night), while it used to be maybe one within a week/10 days before...

Ralf

David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 961
ID: 14
Credit: 2,369,109
RAC: 1,381
Message 70591 - Posted 20 Jun 2011 0:23:30 UTC

We'll look into this further of course. In the mean time, I'd recommend suspending the project and aborting the work units that are stuck. Since it seems consistent on your machine, it would help us while debugging to join our Ralph@home project. We'll likely post an update on Ralph soon. Sorry for all the trouble.

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70594 - Posted 20 Jun 2011 1:23:05 UTC

Have you thought of creating a test application specifically to gather more information on the computer environment it is running on, then sending one such workunit to each machine known to have a problem with workunits freezing? No objection if it then goes on to attempt to run a normal workunit afterwards, possibly with more debugging output than usual enabled.

You'd probably also want to send such workunits to a variety of other computers, to gather outputs for comparison to those from the problem computers.

For example, if it is able to capture the line from the BOINC manager log file describing the CPU capabilities, and perhaps those describing GPU capabilities, just those lines should offer a good starting point in deciding what to look for, if it happens to be something related to matching up properly to the CPU type.

TPCBF

Joined: Nov 29 10
Posts: 108
ID: 403518
Credit: 1,858,486
RAC: 2,912
Message 70596 - Posted 20 Jun 2011 1:46:11 UTC - in response to Message ID 70591.

We'll look into this further of course. In the mean time, I'd recommend suspending the project and aborting the work units that are stuck. Since it seems consistent on your machine, it would help us while debugging to join our Ralph@home project. We'll likely post an update on Ralph soon. Sorry for all the trouble.
Had two more, "freezing" at 1.6% after 1:36h/1:40h, aborted both. Supended Rosetta@Home on this machine, then joined Ralph@Home on this machine, but got so far only

>6/19/2011 6:37:53 PM ralph@home Message from server: No work sent

What's next?

Ralf

TPCBF

Joined: Nov 29 10
Posts: 108
ID: 403518
Credit: 1,858,486
RAC: 2,912
Message 70597 - Posted 20 Jun 2011 1:55:43 UTC - in response to Message ID 70594.

Have you thought of creating a test application specifically to gather more information on the computer environment it is running on, then sending one such workunit to each machine known to have a problem with workunits freezing? No objection if it then goes on to attempt to run a normal workunit afterwards, possibly with more debugging output than usual enabled.

You'd probably also want to send such workunits to a variety of other computers, to gather outputs for comparison to those from the problem computers.

For example, if it is able to capture the line from the BOINC manager log file describing the CPU capabilities, and perhaps those describing GPU capabilities, just those lines should offer a good starting point in deciding what to look for, if it happens to be something related to matching up properly to the CPU type.
I don't see anything about the specs of the machines that would give a direct indication. It happened the most recently on 3 different ones:
- P4 2.8Mhz (single core), 2GB RAM, just sitting idle most of the time as it is my Windows 2008 test server
- Vaio notebook, also sitting idle most of the time as the build-in keyboard is reluctant to work and I need to use an external one, Pentium M760 2GHz, 2GB RAM
- my main work computer, Core 2 Duo 6300@1.866GHz, 4GB RAM (3GB avail under XPSP3)

The last one has the most "freezes", but always plenty of CPU and RAM to spare (average 1GB physical RAM free).

Ralf

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70598 - Posted 20 Jun 2011 2:41:54 UTC

I just had one of the hung workunits, on a computer that's usually much more reliable at completing Rosetta@home workunits.

casd_sgr145_boinc_3duwA_208.nonlocal.pctid_0.09.tmscore_0.63331._nonlocal_tex_IGNORE THE REST_27533_3268

I've selected 12 hour workunit lengths.

Currently at 20:04:42 elapsed, 5.770% progress and not increasing, 53:36:27 to completion.

CPU time at last checkpoint 00:43:47
CPU time 00:43:51

Commit size 352,408 KB

BOINC lists it as running, but it's not using any CPU time.

The workunits on the other CPU cores are from other BOINC projects, and running just fine.

Appears to be one of the several workunits I've seen where the hang occurred just after a checkpoint.

BOINC appears likely to be in a state where it asks for GPU workunits only, and only from BOINC projects where I've never seen any available. I've seen this condition fairly often on my laptop before, but not on my desktop where it is now.

Clicking on the Show Graphics button brings up the minirosetta_graphics_3.13_windows_x86_64.exe program (previously not running) showing a window with a proper frame and a proper label at the top, but with the space inside the frame totally black. Clicking on the X in the red space at the top right corner of the graphics window gives this error message:

minirosetta_graphics_3.13_windows_x86_64.exe is not responding

Problem Event Name: AppHangB1
Hang Signature: dd05
Hang Type: 0
(too many more detail lines to copy if every time I enter a window to copy them to, or start Snipping Tool, the error message disappears.)

Should I abort this workunit so you can see the output files?

Or just restart BOINC to see if that will restart that workunit properly?

Or something else?


6/18/2011 5:48:13 PM Starting BOINC client version 6.10.58 for windows_x86_64
6/18/2011 5:48:13 PM log flags: file_xfer, sched_ops, task
6/18/2011 5:48:13 PM Libraries: libcurl/7.19.7 OpenSSL/0.9.8l zlib/1.2.3
6/18/2011 5:48:13 PM Data directory: C:\ProgramData\BOINC
6/18/2011 5:48:13 PM Running under account Bobby
6/18/2011 5:48:13 PM Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz [Family 6 Model 23 Stepping 10]
6/18/2011 5:48:13 PM Processor: 6.00 MB cache
6/18/2011 5:48:13 PM Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe
6/18/2011 5:48:13 PM OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
6/18/2011 5:48:13 PM Memory: 8.00 GB physical, 15.66 GB virtual
6/18/2011 5:48:13 PM Disk: 919.67 GB total, 544.50 GB free
6/18/2011 5:48:13 PM Local time is UTC -5 hours
6/18/2011 5:48:13 PM NVIDIA GPU 0: GeForce GTS 450 (driver version 26724, CUDA version 3020, compute capability 2.1, 993MB, 476 GFLOPS peak)

6/18/2011 5:48:13 PM General prefs: using separate prefs for work
6/18/2011 5:48:13 PM Reading preferences override file
6/18/2011 5:48:13 PM Preferences:
6/18/2011 5:48:13 PM max memory usage when active: 3276.16MB
6/18/2011 5:48:13 PM max memory usage when idle: 3276.16MB
6/18/2011 5:48:13 PM max disk usage: 30.00GB
6/18/2011 5:48:13 PM max CPUs used: 3
6/18/2011 5:48:13 PM (to change preferences, visit the web site of an attached project, or select Preferences in the Manager)
6/18/2011 5:48:13 PM Not using a proxy

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70599 - Posted 20 Jun 2011 2:48:29 UTC - in response to Message ID 70597.
Last modified: 20 Jun 2011 3:12:11 UTC

Have you thought of creating a test application specifically to gather more information on the computer environment it is running on, then sending one such workunit to each machine known to have a problem with workunits freezing? No objection if it then goes on to attempt to run a normal workunit afterwards, possibly with more debugging output than usual enabled.

You'd probably also want to send such workunits to a variety of other computers, to gather outputs for comparison to those from the problem computers.

For example, if it is able to capture the line from the BOINC manager log file describing the CPU capabilities, and perhaps those describing GPU capabilities, just those lines should offer a good starting point in deciding what to look for, if it happens to be something related to matching up properly to the CPU type.


I don't see anything about the specs of the machines that would give a direct indication. It happened the most recently on 3 different ones:
- P4 2.8Mhz (single core), 2GB RAM, just sitting idle most of the time as it is my Windows 2008 test server
- Vaio notebook, also sitting idle most of the time as the build-in keyboard is reluctant to work and I need to use an external one, Pentium M760 2GHz, 2GB RAM
- my main work computer, Core 2 Duo 6300@1.866GHz, 4GB RAM (3GB avail under XPSP3)

The last one has the most "freezes", but always plenty of CPU and RAM to spare (average 1GB physical RAM free).

Ralf


Combining your system types and mine suggests that it might be worthwhile checking if it is specific to Intel CPUs, and even perhaps some ranges of Intel CPU types.

As for what's next on Ralph@Home, it currently has no workunits queued, so you'll have to wait for some to become available.


Since I currently have an Einstein@Home CPU workunit that seems rather reluctant to finish in a reasonable time (perhaps due to the debt from the several Einstein@Home GPU workunits run recently, I've decided to drain the queue of CPU workunits on my desktop by temporarily setting all BOINC projects offering CPU workunits to No New Tasks.

David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 961
ID: 14
Credit: 2,369,109
RAC: 1,381
Message 70600 - Posted 20 Jun 2011 3:17:46 UTC

If you have stalled work units please go ahead and abort the jobs and manually kill the minirosetta process from the task manager if necessary. we'll post an update on Ralph sometime early next week and submit more test work units. Please post the names of the work units so we know which ones to test on Ralph. I'd also recommend suspending R@h for the time being.

alpha Profile

Joined: Nov 4 06
Posts: 27
ID: 127202
Credit: 953,255
RAC: 781
Message 70603 - Posted 20 Jun 2011 13:54:35 UTC

Computation error: http://boinc.bakerlab.org/rosetta/result.php?resultid=430150155

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x750F9617
____________

Holmis

Joined: Nov 15 07
Posts: 6
ID: 220968
Credit: 975,490
RAC: 0
Message 70604 - Posted 20 Jun 2011 15:36:54 UTC

I've also got a computation error on this task.

<message>
Felaktig funktion. (0x1) - exit code 1 (0x1)
</message>

and

ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: ..\..\..\src\core\import_pose\import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Translation:
Felaktig funktion = Incorrect function

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70605 - Posted 20 Jun 2011 16:54:58 UTC

One more where BOINC thinks it's running, but it's using no CPU time at all:

ilv_hr41_all_boinc_2ebmA_108.nonlocal.pctid_0.14.tmscore_0.45048._nonlocal_tex_IGNORE_THE_REST_27535_3351

12 hour workunits requested.

13:38:38 elapsed, 62.456% progress, 07:46:40 To completion

CPU time at last checkpoint 07:31:42
CPU time 07:31:46

Appears to be one more of the many I've seen that stopped using any CPU time shortly after a checkpoint, or possibly after the checkpoint was started but not yet finished.

Rosetta@Home already on No new tasks while I drain the list of CPU workunits to force one especially slowly running workunit to get enough CPU time to finish. No more already on that computer.

Computer environment already described above.

Alan J Rodger

Joined: Oct 16 05
Posts: 7
ID: 4998
Credit: 32,282
RAC: 0
Message 70607 - Posted 20 Jun 2011 17:29:49 UTC

I've had to abort several work units because time elapsed and time to completion both go up without % completion changing - one work unit reached 25 hours and went from ca 3 hours to completion at the start to 18 hours to completion. Many are minirosetta 3.14. What's up?

Alan
____________

David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 961
ID: 14
Credit: 2,369,109
RAC: 1,381
Message 70608 - Posted 20 Jun 2011 17:53:41 UTC

There are reports of minirosetta 3.14 continuing on with 0 cpu usage. This sounds like the same issue. You'll have to manually kill the process using the task manager and suspend the R@h project for the time being. We are looking into this issue.

.clair.

Joined: Jan 2 07
Posts: 45
ID: 139198
Credit: 6,166,589
RAC: 4,637
Message 70610 - Posted 20 Jun 2011 23:50:45 UTC

Compute error after 2.1 seconds, wingman had the same

ilv_fgf2_all_boinc_2ilaA_139.nonlocal.pctid_0.12.tmscore_._nonlocal_tex_IGNORE_THE_REST_27534_2350_1

ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: src/core/import_pose/import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


workunit - http://boinc.bakerlab.org/rosetta/workunit.php?wuid=392925238
____________

N7QLT

Joined: Dec 19 05
Posts: 2
ID: 40619
Credit: 1,821,427
RAC: 48
Message 70613 - Posted 21 Jun 2011 15:28:48 UTC

I have noticed occasions when workunits are running but not using any CPU. If I exit Boinc and request it to stop science projects, wait a moment and then restart Boinc it seems to reset "something" and the Rosetta projects start running again. I am guessing that it is related to some other process on the system, maybe nightly virus scans or other maintenance. Maybe its related to the workunit in progress. Just haven't had time to research it further.
____________

Holmis

Joined: Nov 15 07
Posts: 6
ID: 220968
Credit: 975,490
RAC: 0
Message 70614 - Posted 21 Jun 2011 19:06:12 UTC

Got one more compute error today, it's the same error as I posted in message #70604 earlier in this thread.

ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: ..\..\..\src\core\import_pose\import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

It also appears to be the same type om task:
ilv_fgf2_all_boinc_2ilaA_139.nonlocal.pctid_0.12.tmscore_._nonlocal_tex_IGNORE_THE_REST_27534_4527_0

and

ilv_fgf2_all_boinc_2ilaA_139.nonlocal.pctid_0.12.tmscore_._nonlocal_tex_IGNORE_THE_REST_27534_6682_0

Link to new error: http://boinc.bakerlab.org/rosetta/result.php?resultid=431062430


Something wrong with them?

.clair.

Joined: Jan 2 07
Posts: 45
ID: 139198
Credit: 6,166,589
RAC: 4,637
Message 70616 - Posted 21 Jun 2011 21:26:54 UTC

It looks like the ilv_fgf2_all_boinc units have a problem another one failed after 2.1 seconds same error as befor.

ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: src/core/import_pose/import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=393122110
____________

James Thompson

Joined: Oct 13 05
Posts: 46
ID: 4392
Credit: 186,109
RAC: 0
Message 70623 - Posted 22 Jun 2011 16:36:30 UTC - in response to Message ID 70616.

It looks like the ilv_fgf2_all_boinc units have a problem another one failed after 2.1 seconds same error as befor.

ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: src/core/import_pose/import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=393122110



Something is wrong with those workunits. I'll remove them now.


____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 70625 - Posted 22 Jun 2011 23:22:45 UTC
Last modified: 22 Jun 2011 23:24:43 UTC

This is not funny, why is this still happening.

This task that should have finished after 49 models, I believe it should stopped but for some reason started again and did one more model and that,s all i'm getting credit for after 4+ hrs.

# cpu_run_time_pref: 14400
======================================================
DONE :: 49 starting structures 14250.2 cpu seconds
This process generated 49 decoys from 49 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 14501.5 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

Valid
Claimed credit__119.52
Granted credit__1.81

application version 3.14
____________


Alan J Rodger

Joined: Oct 16 05
Posts: 7
ID: 4998
Credit: 32,282
RAC: 0
Message 70632 - Posted 24 Jun 2011 15:11:07 UTC

Why don't you stop sending out Minirosetta 3.14 work units until you solve the problem? The other units seem to work.
____________

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70633 - Posted 24 Jun 2011 16:50:01 UTC - in response to Message ID 70632.

Why don't you stop sending out Minirosetta 3.14 work units until you solve the problem? The other units seem to work.


Possibly because the Minirosetta 3.14 workunits work on some computers. For example, my laptop. Possibly because they want to gather more information on what kinds of computers those workunits don't work on.

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 70634 - Posted 24 Jun 2011 18:44:58 UTC

Task 430561413 failed with an Out of Memory message


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x758B9617

Engaging BOINC Windows Runtime Debugger...

(much debugging stuff snipped)

Odd considering it's a C2D with 4M of memory and these tasks don't seem to use more than 300-400K.

I've also been having a lot of these 'task hanging' issues on W7 (not Mac) : they're curable by quitting BOINC and restarting. Haven't been keeping track of all the names but tasks with names like ilv* seem particularly prone to this behaviour.

[AF>france>pas-de-calais]symaski62

Joined: Sep 19 05
Posts: 47
ID: 506
Credit: 33,871
RAC: 0
Message 70635 - Posted 24 Jun 2011 23:11:28 UTC

25/06/2011 01:02:23 | rosetta@home | Started download of minirosetta_3.14_windows_intelx86.exe
25/06/2011 01:02:23 | rosetta@home | Started download of minirosetta_graphics_3.13_windows_intelx86.exe

3.14 version ?



____________

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70641 - Posted 26 Jun 2011 2:48:54 UTC - in response to Message ID 70635.
Last modified: 26 Jun 2011 2:49:25 UTC

25/06/2011 01:02:23 | rosetta@home | Started download of minirosetta_3.14_windows_intelx86.exe
25/06/2011 01:02:23 | rosetta@home | Started download of minirosetta_graphics_3.13_windows_intelx86.exe

3.14 version ?




Yes, the 3.14 version of the main application program. Looks behind on the screensaver graphics program, though.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 70643 - Posted 26 Jun 2011 5:59:44 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=430776403
casd_sgr145_boinc_3lccA_26.nonlocal.pctid_0.20.tmscore_0.67557._nonlocal_tex_IGNORE_THE_REST_27533_2536_1

Outcome Client error
Client state Compute error
Exit status -529697949 (0xe06d7363)
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7C812AFB

This task chewed up 3.24GB of RAM?
That's insane.

darkestkhan

Joined: Nov 16 09
Posts: 2
ID: 358645
Credit: 4,886
RAC: 0
Message 70645 - Posted 26 Jun 2011 8:19:13 UTC

Debian GNU/Linux Sid/Experimental, BOINC 6.10.56
In the middle of night I got:

*** glibc detected *** ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.14_x86_64-pc-linux-gnu: double free or corruption (!prev): 0x1719e408 ***
======= Backtrace: =========
[0xa449b81]
[0xa44d69b]
[0xa411111]
[0x817a794]
[0xa427a5d]
[0xa38b0ca]
[0xa38b50a]
[0xf77b9400]
[0x80501d0]
[0xa45bafc]
[0x817b9ff]
[0x8049480]
[0xa4602de]
======= Memory map: ========
08048000-0a999000 r-xp 00000000 fe:02 3260875 /home/darkestkhan/BOINC/projects/boinc.bakerlab.org_rosetta/minirosetta_3.14_x86_64-pc-linux-gnu
0a999000-0a9a0000 rwxp 02950000 fe:02 3260875 /home/darkestkhan/BOINC/projects/boinc.bakerlab.org_rosetta/minirosetta_3.14_x86_64-pc-linux-gnu
0a9a0000-0ab5c000 rwxp 00000000 00:00 0
0bbe4000-1793c000 rwxp 00000000 00:00 0 [heap]
ef900000-ef9ae000 rwxp 00000000 00:00 0
ef9ae000-efa00000 ---p 00000000 00:00 0
efa6a000-efa6b000 ---p 00000000 00:00 0
efa6b000-f0f56000 rwxp 00000000 00:00 0
f111a000-f627a000 rwxp 00000000 00:00 0
f627a000-f758e000 rwxs 00000000 fe:02 1081610 /home/darkestkhan/BOINC/slots/0/boinc_minirosetta_0
f758e000-f758f000 ---p 00000000 00:00 0
f758f000-f7592000 rwxp 00000000 00:00 0
f7592000-f7594000 rwxs 00000000 fe:02 1081606 /home/darkestkhan/BOINC/slots/0/boinc_mmap_file
f7594000-f77b9000 rwxp 00000000 00:00 0
f77b9000-f77ba000 r-xp 00000000 00:00 0 [vdso]
ff9db000-ff9fc000 rw-p 00000000 00:00 0 [stack]

cnick6

Joined: May 30 06
Posts: 24
ID: 85398
Credit: 5,296,712
RAC: 6,626
Message 70650 - Posted 27 Jun 2011 14:21:06 UTC

A couple of 3.14 client crashes:

http://boinc.bakerlab.org/rosetta/result.php?resultid=431970915
http://boinc.bakerlab.org/rosetta/result.php?resultid=431639167

____________

Christoph

Joined: Dec 10 05
Posts: 57
ID: 33998
Credit: 1,509,578
RAC: 0
Message 70656 - Posted 28 Jun 2011 13:32:52 UTC

With the three stuck workunits I have at the moment, I can confirm that they seem to get stuck right after a checkpoint, maybe even at the checkpoint itself.

The three workunits are:
http://boinc.bakerlab.org/rosetta/result.php?resultid=432501911, last checkpoint: 8:38:03, cpu time: 8:38:04
http://boinc.bakerlab.org/rosetta/result.php?resultid=432531601, last checkpoint: 8:25:02, cpu time: 8:25:03
http://boinc.bakerlab.org/rosetta/result.php?resultid=432552233, last checkpoint: 6:24:59, cpu time: 6:24:59

I used Process Explorer to take a look at the threads and call stacks of one of those and posted them here, hope they are somewhat helpful.

Then I noticed that one of the threads is in suspended state, manually resumed it and the WU continued fine without problems. This has worked for all three workunits that I've tested. Looks to me like the worker thread isn't resumed after a checkpoint was done.

Tex1954

Joined: Apr 3 11
Posts: 9
ID: 415829
Credit: 2,607,582
RAC: 0
Message 70669 - Posted 1 Jul 2011 12:35:22 UTC

I'm gettng a lot of computational errors lately, here's 3 in a row!

433392690 395543410 1 Jul 2011 9:29:55 UTC 1 Jul 2011 12:37:37 UTC Over Client error Compute error 592.18 5.51 ---
433361641 395530375 1 Jul 2011 6:18:15 UTC 1 Jul 2011 12:20:55 UTC Over Client error Compute error 519.42 4.84 ---
433327434 395499534 1 Jul 2011 2:02:12 UTC 1 Jul 2011 12:09:45 UTC Over Client error Compute error 612.05 5.70 ---

Sheesh...

:D

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 70671 - Posted 1 Jul 2011 18:31:25 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=395439475

I was the wingman on this and this also died from a C++ error dealing with memory.

first person gets: process exited with code 193 (0xc1, -63) and terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::domain_error> >'
what(): Error in function boost::math::normal_distribution<d>::normal_distribution: Location parameter is nan, but must be finite!
SIGABRT: abort called

I get: - exit code -529697949 (0xe06d7363)
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7C812AFB

cnick6

Joined: May 30 06
Posts: 24
ID: 85398
Credit: 5,296,712
RAC: 6,626
Message 70672 - Posted 2 Jul 2011 0:27:51 UTC

Win64 client crash:

32.4: kd:x86> kp
ChildEBP RetAddr
01d8e18c 00411ef0 KERNELBASE!DebugBreak+0x2
01d8e1b0 00401d3f minirosetta_3_14_windows_x86_64!memcpy_s(void * dst = 0x00000000`1cf50020, unsigned int sizeInBytes = 0xfd0f2ef, void * src = 0x00000000`00000000, unsigned int count = 0xfd0f2e7)+0x2b [f:\sp\vctools\crt_bld\self_x86\crt\src\memcpy_s.c @ 55]
01d8e1d4 00ae1128 minirosetta_3_14_windows_x86_64!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign(class std::basic_string<char,std::char_traits<char>,std::allocator<char> > * _Right = <Memory access error>, unsigned int _Roff = <Memory access error>, unsigned int _Count = <Memory access error>)+0xbf [c:\program files (x86)\microsoft visual studio 8\vc\include\xstring @ 1049]
01d8e1f0 00ae1165 minirosetta_3_14_windows_x86_64!core::chemical::name_from_aa(core::chemical::AA aa = 0n0 (No matching enumerant))+0x68 [d:\boinc_build\minirosetta_beta_3.14\mini\src\core\chemical\aa.cc @ 253]
01d8e22c 00d5b843 minirosetta_3_14_windows_x86_64!core::chemical::operator<<(class std::basic_ostream<char,std::char_traits<char> > * os = 0x00000000`010d0820, core::chemical::AA * aa = 0x00000000`00000000)+0x35 [d:\boinc_build\minirosetta_beta_3.14\mini\src\core\chemical\aa.cc @ 245]
01d8e2c0 00d5bcb5 minirosetta_3_14_windows_x86_64!core::fragment::make_pose_from_sequence_(class std::basic_string<char,std::char_traits<char>,std::allocator<char> > sequence = class std::basic_string<char,std::char_traits<char>,std::allocator<char> >, class core::chemical::ResidueTypeSet * residue_set = 0x00000000`00000000, class core::pose::Pose * pose = 0x00000000`167f1548)+0x113 [d:\boinc_build\minirosetta_beta_3.14\mini\src\core\fragment\frame.cc @ 68]
01d8e2f4 0085445f minirosetta_3_14_windows_x86_64!core::fragment::Frame::fragment_as_pose(unsigned int frag_num = 0, class core::pose::Pose * pose = 0x00000000`00000000, class utility::pointer::access_ptr<core::chemical::ResidueTypeSet const > restype_set = class utility::pointer::access_ptr<core::chemical::ResidueTypeSet const >)+0x35 [d:\boinc_build\minirosetta_beta_3.14\mini\src\core\fragment\frame.cc @ 402]
01d8e328 00854795 minirosetta_3_14_windows_x86_64!protocols::basic_moves::GunnCost::compute_gunn(class core::fragment::Frame * frame = 0x00000000`00000000, unsigned int frag_num = 0, struct protocols::basic_moves::GunnTuple * data = 0x00000000`6e6e7547)+0xff [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\basic_moves\gunncost.cc @ 95]
01d8e478 00a50530 minirosetta_3_14_windows_x86_64!protocols::basic_moves::GunnCost::score(class core::fragment::Frame * frame = 0x00000000`167cd0f8, class core::pose::Pose * pose = 0x00000000`01d8eb18, class utility::vector1<double,std::allocator<double> > * scores = 0x00000000`01d8e4b0)+0x2c5 [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\basic_moves\gunncost.cc @ 77]
01d8e4ec 0094b91f minirosetta_3_14_windows_x86_64!protocols::nonlocal::SmoothPolicy::choose(class core::fragment::Frame * frame = 0x00000000`167cd0f8, class core::pose::Pose * pose = 0x00000000`01d8eb18)+0x60 [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\smoothpolicy.cc @ 50]
01d8e7f0 009b556c minirosetta_3_14_windows_x86_64!protocols::nonlocal::SingleFragmentMover::apply(class core::pose::Pose * pose = 0x00000000`00000000)+0x15f [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\singlefragmentmover.cc @ 116]
01d8e97c 00403cc2 minirosetta_3_14_windows_x86_64!protocols::nonlocal::RationalMonteCarlo::apply(class core::pose::Pose * pose = 0x00000000`01d8eb18)+0x7c [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\rationalmontecarlo.cc @ 62]
01d8e9b4 009b51ec minirosetta_3_14_windows_x86_64!std::basic_ostream<char,std::char_traits<char> >::put(char _Ch = <Memory access error>)+0x102 [c:\program files (x86)\microsoft visual studio 8\vc\include\ostream @ 528]
01d8e9f0 00675f4c minirosetta_3_14_windows_x86_64!protocols::nonlocal::BrokenBase::apply(class core::pose::Pose * pose = <Memory access error>)+0x20c [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\brokenbase.cc @ 68]
01d8eafc 005d137b minirosetta_3_14_windows_x86_64!protocols::nonlocal::NonlocalAbinitio::apply(class core::pose::Pose * pose = 0x00000000`01d8eb18)+0x45c [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\nonlocalabinitio.cc @ 240]
01d8eca4 005d1cd4 minirosetta_3_14_windows_x86_64!protocols::jd2::JobDistributor::go_main(class utility::pointer::owning_ptr<protocols::moves::Mover> mover = class utility::pointer::owning_ptr<protocols::moves::Mover>)+0xa2b [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\jd2\jobdistributor.cc @ 376]
01d8ecc4 00837b46 minirosetta_3_14_windows_x86_64!protocols::jd2::JobDistributor::go(class utility::pointer::owning_ptr<protocols::moves::Mover> mover = class utility::pointer::owning_ptr<protocols::moves::Mover>)+0x44 [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\jd2\jobdistributor.cc @ 201]
01d8ecf8 005de323 minirosetta_3_14_windows_x86_64!protocols::jd2::BOINCJobDistributor::go(class utility::pointer::owning_ptr<protocols::moves::Mover> mover = class utility::pointer::owning_ptr<protocols::moves::Mover>)+0xb6 [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\jd2\boincjobdistributor.cc @ 96]
01d8ed48 0040579d minirosetta_3_14_windows_x86_64!protocols::nonlocal::NonlocalAbinitio_main(void * __formal = 0x00000000`00000000)+0x223 [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\nonlocalabinitiomain.cc @ 80]
01d8eedc 00405bf5 minirosetta_3_14_windows_x86_64!main(int argc = 0n25, char ** argv = 0x00000000`01d8eef4)+0xe1d [d:\boinc_build\minirosetta_beta_3.14\mini\src\apps\public\boinc\minirosetta.cc @ 220]
01d8fef0 004186b7 minirosetta_3_14_windows_x86_64!WinMain(struct HINSTANCE__ * hInst = 0x00000000`76ca33ca, struct HINSTANCE__ * hPrevInst = 0x00000000`7efde000, char * Args = 0x00000000`01d8ffd4 "???", int WinMode = 0n2010029778)+0x25 [d:\boinc_build\minirosetta_beta_3.14\mini\src\apps\public\boinc\minirosetta.cc @ 292]
01d8ff88 76ca33ca minirosetta_3_14_windows_x86_64!__tmainCRTStartup(void)+0x177 [f:\sp\vctools\crt_bld\self_x86\crt\src\crt0.c @ 324]
01d8ff94 77ce9ed2 kernel32!BaseThreadInitThunk+0xe
01d8ffd4 77ce9ea5 ntdll_77cb0000!__RtlUserThreadStart+0x70
01d8ffec 00000000 ntdll_77cb0000!_RtlUserThreadStart+0x1b

32.4: kd:x86> u minirosetta_3_14_windows_x86_64!memcpy_s minirosetta_3_14_windows_x86_64!memcpy_s+0x2b
minirosetta_3_14_windows_x86_64!memcpy_s [f:\sp\vctools\crt_bld\self_x86\crt\src\memcpy_s.c @ 47]:
00000000`00411ec5 55 push ebp
00000000`00411ec6 8bec mov ebp,esp
00000000`00411ec8 56 push esi
00000000`00411ec9 8b7514 mov esi,dword ptr [ebp+14h]
00000000`00411ecc 57 push edi
00000000`00411ecd 33ff xor edi,edi
00000000`00411ecf 3bf7 cmp esi,edi
00000000`00411ed1 7504 jne minirosetta_3_14_windows_x86_64!memcpy_s+0x12 (00411ed7)
00000000`00411ed3 33c0 xor eax,eax
00000000`00411ed5 eb65 jmp minirosetta_3_14_windows_x86_64!memcpy_s+0x77 (00411f3c)
00000000`00411ed7 397d08 cmp dword ptr [ebp+8],edi
00000000`00411eda 751b jne minirosetta_3_14_windows_x86_64!memcpy_s+0x32 (00411ef7)
00000000`00411edc e81a150000 call minirosetta_3_14_windows_x86_64!_errno (004133fb)
00000000`00411ee1 6a16 push 16h
00000000`00411ee3 5e pop esi
00000000`00411ee4 8930 mov dword ptr [eax],esi
00000000`00411ee6 57 push edi
00000000`00411ee7 57 push edi
00000000`00411ee8 57 push edi
00000000`00411ee9 57 push edi
00000000`00411eea 57 push edi
00000000`00411eeb e84c210000 call minirosetta_3_14_windows_x86_64!_invalid_parameter (0041403c)


I have a mini kernel dump if anyone needs it.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 70673 - Posted 2 Jul 2011 5:59:42 UTC - in response to Message ID 70672.

Win64 client crash:

32.4: kd:x86> kp
ChildEBP RetAddr
01d8e18c 00411ef0 KERNELBASE!DebugBreak+0x2
01d8e1b0 00401d3f minirosetta_3_14_windows_x86_64!memcpy_s(void * dst = 0x00000000`1cf50020, unsigned int sizeInBytes = 0xfd0f2ef, void * src = 0x00000000`00000000, unsigned int count = 0xfd0f2e7)+0x2b [f:\sp\vctools\crt_bld\self_x86\crt\src\memcpy_s.c @ 55]
01d8e1d4 00ae1128 minirosetta_3_14_windows_x86_64!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign(class std::basic_string<char,std::char_traits<char>,std::allocator<char> > * _Right = <Memory access error>, unsigned int _Roff = <Memory access error>, unsigned int _Count = <Memory access error>)+0xbf [c:\program files (x86)\microsoft visual studio 8\vc\include\xstring @ 1049]
01d8e1f0 00ae1165 minirosetta_3_14_windows_x86_64!core::chemical::name_from_aa(core::chemical::AA aa = 0n0 (No matching enumerant))+0x68 [d:\boinc_build\minirosetta_beta_3.14\mini\src\core\chemical\aa.cc @ 253]
01d8e22c 00d5b843 minirosetta_3_14_windows_x86_64!core::chemical::operator<<(class std::basic_ostream<char,std::char_traits<char> > * os = 0x00000000`010d0820, core::chemical::AA * aa = 0x00000000`00000000)+0x35 [d:\boinc_build\minirosetta_beta_3.14\mini\src\core\chemical\aa.cc @ 245]
01d8e2c0 00d5bcb5 minirosetta_3_14_windows_x86_64!core::fragment::make_pose_from_sequence_(class std::basic_string<char,std::char_traits<char>,std::allocator<char> > sequence = class std::basic_string<char,std::char_traits<char>,std::allocator<char> >, class core::chemical::ResidueTypeSet * residue_set = 0x00000000`00000000, class core::pose::Pose * pose = 0x00000000`167f1548)+0x113 [d:\boinc_build\minirosetta_beta_3.14\mini\src\core\fragment\frame.cc @ 68]
01d8e2f4 0085445f minirosetta_3_14_windows_x86_64!core::fragment::Frame::fragment_as_pose(unsigned int frag_num = 0, class core::pose::Pose * pose = 0x00000000`00000000, class utility::pointer::access_ptr<core::chemical::ResidueTypeSet const > restype_set = class utility::pointer::access_ptr<core::chemical::ResidueTypeSet const >)+0x35 [d:\boinc_build\minirosetta_beta_3.14\mini\src\core\fragment\frame.cc @ 402]
01d8e328 00854795 minirosetta_3_14_windows_x86_64!protocols::basic_moves::GunnCost::compute_gunn(class core::fragment::Frame * frame = 0x00000000`00000000, unsigned int frag_num = 0, struct protocols::basic_moves::GunnTuple * data = 0x00000000`6e6e7547)+0xff [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\basic_moves\gunncost.cc @ 95]
01d8e478 00a50530 minirosetta_3_14_windows_x86_64!protocols::basic_moves::GunnCost::score(class core::fragment::Frame * frame = 0x00000000`167cd0f8, class core::pose::Pose * pose = 0x00000000`01d8eb18, class utility::vector1<double,std::allocator<double> > * scores = 0x00000000`01d8e4b0)+0x2c5 [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\basic_moves\gunncost.cc @ 77]
01d8e4ec 0094b91f minirosetta_3_14_windows_x86_64!protocols::nonlocal::SmoothPolicy::choose(class core::fragment::Frame * frame = 0x00000000`167cd0f8, class core::pose::Pose * pose = 0x00000000`01d8eb18)+0x60 [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\smoothpolicy.cc @ 50]
01d8e7f0 009b556c minirosetta_3_14_windows_x86_64!protocols::nonlocal::SingleFragmentMover::apply(class core::pose::Pose * pose = 0x00000000`00000000)+0x15f [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\singlefragmentmover.cc @ 116]
01d8e97c 00403cc2 minirosetta_3_14_windows_x86_64!protocols::nonlocal::RationalMonteCarlo::apply(class core::pose::Pose * pose = 0x00000000`01d8eb18)+0x7c [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\rationalmontecarlo.cc @ 62]
01d8e9b4 009b51ec minirosetta_3_14_windows_x86_64!std::basic_ostream<char,std::char_traits<char> >::put(char _Ch = <Memory access error>)+0x102 [c:\program files (x86)\microsoft visual studio 8\vc\include\ostream @ 528]
01d8e9f0 00675f4c minirosetta_3_14_windows_x86_64!protocols::nonlocal::BrokenBase::apply(class core::pose::Pose * pose = <Memory access error>)+0x20c [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\brokenbase.cc @ 68]
01d8eafc 005d137b minirosetta_3_14_windows_x86_64!protocols::nonlocal::NonlocalAbinitio::apply(class core::pose::Pose * pose = 0x00000000`01d8eb18)+0x45c [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\nonlocalabinitio.cc @ 240]
01d8eca4 005d1cd4 minirosetta_3_14_windows_x86_64!protocols::jd2::JobDistributor::go_main(class utility::pointer::owning_ptr<protocols::moves::Mover> mover = class utility::pointer::owning_ptr<protocols::moves::Mover>)+0xa2b [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\jd2\jobdistributor.cc @ 376]
01d8ecc4 00837b46 minirosetta_3_14_windows_x86_64!protocols::jd2::JobDistributor::go(class utility::pointer::owning_ptr<protocols::moves::Mover> mover = class utility::pointer::owning_ptr<protocols::moves::Mover>)+0x44 [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\jd2\jobdistributor.cc @ 201]
01d8ecf8 005de323 minirosetta_3_14_windows_x86_64!protocols::jd2::BOINCJobDistributor::go(class utility::pointer::owning_ptr<protocols::moves::Mover> mover = class utility::pointer::owning_ptr<protocols::moves::Mover>)+0xb6 [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\jd2\boincjobdistributor.cc @ 96]
01d8ed48 0040579d minirosetta_3_14_windows_x86_64!protocols::nonlocal::NonlocalAbinitio_main(void * __formal = 0x00000000`00000000)+0x223 [d:\boinc_build\minirosetta_beta_3.14\mini\src\protocols\nonlocal\nonlocalabinitiomain.cc @ 80]
01d8eedc 00405bf5 minirosetta_3_14_windows_x86_64!main(int argc = 0n25, char ** argv = 0x00000000`01d8eef4)+0xe1d [d:\boinc_build\minirosetta_beta_3.14\mini\src\apps\public\boinc\minirosetta.cc @ 220]
01d8fef0 004186b7 minirosetta_3_14_windows_x86_64!WinMain(struct HINSTANCE__ * hInst = 0x00000000`76ca33ca, struct HINSTANCE__ * hPrevInst = 0x00000000`7efde000, char * Args = 0x00000000`01d8ffd4 "???", int WinMode = 0n2010029778)+0x25 [d:\boinc_build\minirosetta_beta_3.14\mini\src\apps\public\boinc\minirosetta.cc @ 292]
01d8ff88 76ca33ca minirosetta_3_14_windows_x86_64!__tmainCRTStartup(void)+0x177 [f:\sp\vctools\crt_bld\self_x86\crt\src\crt0.c @ 324]
01d8ff94 77ce9ed2 kernel32!BaseThreadInitThunk+0xe
01d8ffd4 77ce9ea5 ntdll_77cb0000!__RtlUserThreadStart+0x70
01d8ffec 00000000 ntdll_77cb0000!_RtlUserThreadStart+0x1b

32.4: kd:x86> u minirosetta_3_14_windows_x86_64!memcpy_s minirosetta_3_14_windows_x86_64!memcpy_s+0x2b
minirosetta_3_14_windows_x86_64!memcpy_s [f:\sp\vctools\crt_bld\self_x86\crt\src\memcpy_s.c @ 47]:
00000000`00411ec5 55 push ebp
00000000`00411ec6 8bec mov ebp,esp
00000000`00411ec8 56 push esi
00000000`00411ec9 8b7514 mov esi,dword ptr [ebp+14h]
00000000`00411ecc 57 push edi
00000000`00411ecd 33ff xor edi,edi
00000000`00411ecf 3bf7 cmp esi,edi
00000000`00411ed1 7504 jne minirosetta_3_14_windows_x86_64!memcpy_s+0x12 (00411ed7)
00000000`00411ed3 33c0 xor eax,eax
00000000`00411ed5 eb65 jmp minirosetta_3_14_windows_x86_64!memcpy_s+0x77 (00411f3c)
00000000`00411ed7 397d08 cmp dword ptr [ebp+8],edi
00000000`00411eda 751b jne minirosetta_3_14_windows_x86_64!memcpy_s+0x32 (00411ef7)
00000000`00411edc e81a150000 call minirosetta_3_14_windows_x86_64!_errno (004133fb)
00000000`00411ee1 6a16 push 16h
00000000`00411ee3 5e pop esi
00000000`00411ee4 8930 mov dword ptr [eax],esi
00000000`00411ee6 57 push edi
00000000`00411ee7 57 push edi
00000000`00411ee8 57 push edi
00000000`00411ee9 57 push edi
00000000`00411eea 57 push edi
00000000`00411eeb e84c210000 call minirosetta_3_14_windows_x86_64!_invalid_parameter (0041403c)


I have a mini kernel dump if anyone needs it.



please don't paste such a long dump, just post a link to the task either in URL form or just plain text. or do like I do, just summarize and post a few of the most relevant pieces. long text dumps like this clog the thread. the team can look at the link you posted for further information.

Chilean Profile
Avatar

Joined: Oct 16 05
Posts: 651
ID: 5008
Credit: 10,394,263
RAC: 2,476
Message 70674 - Posted 2 Jul 2011 6:24:12 UTC

I'll post a few WUs that have failed:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=393616699

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=394975566

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=390946185

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=395576024


Hope it helps.
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 70680 - Posted 2 Jul 2011 22:06:18 UTC

This one failed after 13min.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=395597786

ggc_boinc_rosetta_cm_nonlocal_sounier_IGNORE_THE_REST_28252_8806_1

Part of result log.

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>


# cpu_run_time_pref: 14400
cannot find aminoacid SIGSEGV: segmentation violation
Stack trace (20 frames):
[0xa38f2d7]
[0xf77f1400]
[0xa3f14cf]
[0xa05a145]
[0x92e2b97]
[0x92e2da7]
[0x8137d29]
[0x8138bce]
[0x9131838]
[0x912e0e6]
[0x9127976]
[0x9284619]
[0x8abb3dc]
[0x815fa96]
[0x8161a79]
[0x915ffa5]
[0x81041f8]
[0x8054421]
[0xa41f118]
[0x8048131]

Exiting...

</stderr_txt>
]]>

____________


ComfortablyNumb

Joined: Jul 6 07
Posts: 8
ID: 188768
Credit: 658,196
RAC: 0
Message 70685 - Posted 4 Jul 2011 16:24:22 UTC

Nothing but computation error's lately. Anybody else having this? I have reached the mavimum number of results(92) at noon. Mini 3.14

Chilean Profile
Avatar

Joined: Oct 16 05
Posts: 651
ID: 5008
Credit: 10,394,263
RAC: 2,476
Message 70687 - Posted 4 Jul 2011 19:46:17 UTC - in response to Message ID 70685.

Nothing but computation error's lately. Anybody else having this? I have reached the mavimum number of results(92) at noon. Mini 3.14


That's really odd. All of my PCs don't have any errors, except for one or 2 every 30-40 WUs... is your PC running stable? Any overclocking?

If you overclock, run this benchmark: http://www.xtremesystems.org/forums/showthread.php?201670-LinX-A-simple-Linpack-interface

Set the runs at 20, and wait. If no error, then your PC is rock stable and it's a WU problem. If your PC is just a bit unstable, Linx should pick it up.
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 70701 - Posted 9 Jul 2011 5:51:41 UTC

Hi.

Mine seems to have finished O.K. but i got no credit for it, see what happened to the other two below. Any chance it can be fixed and get the credits for it.?


casd_rhodopsin_boinc_1l0mA_53.abrelax_cs_frags.pctid_0.25.tmscore_0.66390._abrelax_cs_frags_tex_IGNORE_THE_REST_27887_8592

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=394828271

27 Jun 2011 20:41:20 UTC__7 Jul 2011 20:41:20 UTC__Over__No reply__New__0.00
7 Jul 2011 20:49:24 UTC__7 Jul 2011 21:59:42 UTC__Over__Client error__Compute error__0.00
Mine = 7 Jul 2011 22:06:17 UTC__9 Jul 2011 5:37:38 UTC__Over__Validate error__Done__13,837.73

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<stderr_txt>
[2011- 7- 8 22:53:55:] :: BOINC:: Initializing ... ok.
[2011- 7- 8 22:53:55:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.



# cpu_run_time_pref: 14400
Continuing computation from checkpoint: chk_S_00003_FragmentSampler__stage1 ... success!
Continuing computation from checkpoint: chk_S_00003_FragmentSampler__stage2 ... success!
Continuing computation from checkpoint: chk_S_00003_FragmentSampler__stage_3_iter1_1 ... success!
Continuing computation from checkpoint: chk_S_00003_FragmentSampler__stage_3_iter1_2 ... success!
Continuing computation from checkpoint: chk_S_00003_FragmentSampler__stage_3_iter1_3 ... success!
Continuing computation from checkpoint: chk_S_00003_FragmentSampler__stage_3_iter1_4 ... success!
Continuing computation from checkpoint: chk_S_00003_FragmentSampler__stage_3_iter1_5 ... success!
Continuing computation from checkpoint: chk_S_00003_FragmentSampler__stage_3_iter1_6 ... success!
======================================================
DONE :: 21 starting structures 13837.5 cpu seconds
This process generated 21 decoys from 21 attempts
======================================================
BOINC :: WS_max 5.1645e+120

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state___Invalid
Claimed credit___102.49362635862
Granted credit___0
application version: 3.14

____________


Rabinovitch Profile
Avatar

Joined: Apr 28 07
Posts: 28
ID: 170444
Credit: 1,483,610
RAC: 2,997
Message 70704 - Posted 10 Jul 2011 5:49:05 UTC
Last modified: 10 Jul 2011 5:50:03 UTC

Hi all!
I have just installed new "inkarnation" of Kubuntu 10.10 amd64, BOINC and x32-libs proposed by boinc.berkeley.edu. All my CPU tasks (rosetta, ralph and QMC) are exiting with "Compute error" after several hours of processing. Rosetta's application says different things, for example:

Task 435109856

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>
<stderr_txt>
[2011- 7- 9 18:58: 0:] :: BOINC:: Initializing ... ok.
[2011- 7- 9 18:58: 0:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev42272.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/ECH19_looprem_verif_long404.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 86400
FILE_LOCK::unlock(): close failed.: Bad file descriptor
[2011- 7- 9 19:27:26:] :: BOINC:: Initializing ... ok.
[2011- 7- 9 19:27:26:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev42272.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/ECH19_looprem_verif_long404.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 86400

</stderr_txt>
]]>

Task 435105624

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Input file minirosetta_database_rev42272.zip missing or invalid: -120
</message>
]]>

For QMC tasks there is always "process got signal 11" message.

What can the matter be? Einstein's CUDA WUs are working good enough.
____________
From Siberia with love!

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 70705 - Posted 10 Jul 2011 15:30:51 UTC
Last modified: 10 Jul 2011 15:33:10 UTC

Looks like it's having a problem unzipping the minirosetta_database_rev42272.zip file mentioned. You might just download it to a sandbox and see if you can unzip it from the command line. If not, perhaps there is now something inconsistent about your setup that is interfering with the unzip. Here's a direct link to download that specific file outside the BOINC client (just use wget), so you can tinker with it.

http://boinc.bakerlab.org/rosetta/download/minirosetta_database_rev42272.zip

Going through the steps manually might also uncover any network/firewall or antivirus issues that may be effecting things as well.
____________
Rosetta Moderator: Mod.Sense

ecafkid Profile

Joined: Oct 5 05
Posts: 40
ID: 2748
Credit: 15,177,319
RAC: 0
Message 70708 - Posted 11 Jul 2011 12:03:04 UTC

I have added a new machine to my account and have several days worth of WU's> My question is the WU's start processing and at some point and time (generally when they are above 50% complete they will change to waiting to run and other WU's start up and say running high priority. What would make this happen?
____________

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70709 - Posted 11 Jul 2011 13:47:56 UTC - in response to Message ID 70708.

I have added a new machine to my account and have several days worth of WU's> My question is the WU's start processing and at some point and time (generally when they are above 50% complete they will change to waiting to run and other WU's start up and say running high priority. What would make this happen?


A typical result of connecting to a BOINC project that overestimates how many workunits a new computer can complete by the deadline.

ecafkid Profile

Joined: Oct 5 05
Posts: 40
ID: 2748
Credit: 15,177,319
RAC: 0
Message 70710 - Posted 11 Jul 2011 15:14:26 UTC - in response to Message ID 70709.

I have added a new machine to my account and have several days worth of WU's> My question is the WU's start processing and at some point and time (generally when they are above 50% complete they will change to waiting to run and other WU's start up and say running high priority. What would make this happen?


A typical result of connecting to a BOINC project that overestimates how many workunits a new computer can complete by the deadline.



Thanks
____________

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70711 - Posted 11 Jul 2011 16:44:26 UTC
Last modified: 11 Jul 2011 16:46:53 UTC

You're welcome.

BOINC often corrects that problem after it has completed enough workunits from that project to make a better estimate of how long each workunit should run on that computer.

dgnuff Profile
Avatar

Joined: Nov 1 05
Posts: 347
ID: 8170
Credit: 23,349,222
RAC: 5,850
Message 70726 - Posted 15 Jul 2011 8:50:01 UTC
Last modified: 15 Jul 2011 8:51:24 UTC

Like several people in here, I'm getting the occasional hang with the new 3.14 client. Hopefully I'll be able to come back and edit this to add further crashed WU's, since I get about one a day from the farm I have here.

To get things started ...

ilv_fgf2_all_boinc_1n4kA_124.nonlocal.pctid_0.19.tmscore_0.62808._nonlocal_tex_IGNORE_THE_REST_27534_15684_0

-- Edit -- missed a period in the WU name.
____________

Jim Strait

Joined: Dec 10 05
Posts: 2
ID: 33962
Credit: 1,318,612
RAC: 686
Message 70727 - Posted 15 Jul 2011 11:36:52 UTC

Recently I have been getting a number of Pop-ups that say:
===========================================
Microsoft Visual C++ Runtime Library

Runtime Error!

Program: ...kerlab.org_rosetta\minirosetta_3.14_windows_intelx86.exe

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
===========================================

I am running Microsoft Windows XP
Professional x86 Edition, Service Pack 3, (05.01.2600.00)
On a GenuineIntel
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz [x86 Family 6 Model 26 Stepping 5]








____________

Alan J Rodger

Joined: Oct 16 05
Posts: 7
ID: 4998
Credit: 32,282
RAC: 0
Message 70731 - Posted 15 Jul 2011 16:19:04 UTC

Still getting hangups on Mini 3.14 - is there a plan to fix this or stop issuing 3.14 WUs?

____________

darkestkhan

Joined: Nov 16 09
Posts: 2
ID: 358645
Credit: 4,886
RAC: 0
Message 70732 - Posted 15 Jul 2011 23:01:46 UTC

I just got yet another SIGSEGV on up to date Debian GNU/Linux sid/exp:

*** glibc detected *** ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.14_x86_64-pc-linux-gnu: double free or corruption (fasttop): 0x0c4ee8a8 ***
======= Backtrace: =========
[0xa449b81]
[0xa44d69b]
[0xa411111]
[0xa427a5d]
[0xa38b0ca]
[0xa38b50a]
[0xf776d400]
[0x80501d0]
[0xa45bafc]
[0x817b9ff]
[0x8049480]
[0xa4602de]
======= Memory map: ========
08048000-0a999000 r-xp 00000000 fe:02 3260875 /home/darkestkhan/BOINC/projects/boinc.bakerlab.org_rosetta/minirosetta_3.14_x86_64-pc-linux-gnu
0a999000-0a9a0000 rwxp 02950000 fe:02 3260875 /home/darkestkhan/BOINC/projects/boinc.bakerlab.org_rosetta/minirosetta_3.14_x86_64-pc-linux-gnu
0a9a0000-0ab5c000 rwxp 00000000 00:00 0
0c402000-17cf0000 rwxp 00000000 00:00 0 [heap]
ef900000-ef980000 rwxp 00000000 00:00 0
ef980000-efa00000 ---p 00000000 00:00 0
efa85000-f070a000 rwxp 00000000 00:00 0
f08ce000-f5a2e000 rwxp 00000000 00:00 0
f5a2e000-f5a2f000 ---p 00000000 00:00 0
f5a2f000-f622e000 rwxp 00000000 00:00 0
f622e000-f7542000 rwxs 00000000 fe:02 1089644 /home/darkestkhan/BOINC/slots/0/boinc_minirosetta_0
f7542000-f7543000 ---p 00000000 00:00 0
f7543000-f7546000 rwxp 00000000 00:00 0
f7546000-f7548000 rwxs 00000000 fe:02 1089639 /home/darkestkhan/BOINC/slots/0/boinc_mmap_file
f7548000-f776d000 rwxp 00000000 00:00 0
f776d000-f776e000 r-xp 00000000 00:00 0 [vdso]
ff97c000-ff9bc000 rw-p 00000000 00:00 0 [stack]

____________

Snagletooth

Joined: Feb 22 07
Posts: 193
ID: 149031
Credit: 1,425,415
RAC: 236
Message 70733 - Posted 15 Jul 2011 23:45:48 UTC

This has probably been noticed already as it failed on both crunchers but in case it's a mac only problem (and thus somewhat less likely to be spotted)...

ilv_fgf2_all_boinc_2ilaA_139.nonlocal.pctid_0.12.tmscore_._nonlocal_tex_IGNORE_THE_REST_27534_18042_1


ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: src/core/import_pose/import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Best,
Snags

[AF>france>pas-de-calais]symaski62

Joined: Sep 19 05
Posts: 47
ID: 506
Credit: 33,871
RAC: 0
Message 70762 - Posted 21 Jul 2011 22:40:25 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=437733634


Exit status -1073741819 (0xc0000005)

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00B26465 read attempt to address 0x00000008

Engaging BOINC Windows Runtime Debugger...
____________

Old man

Joined: Nov 10 07
Posts: 22
ID: 219792
Credit: 470,314
RAC: 0
Message 70791 - Posted 26 Jul 2011 14:29:48 UTC

Task ID 438352266
Name T610_bn_rs_stg0_lrlxMultiCst_t000__casp9__aln1_SAVE_ALL_OUT_29826_110_0
Workunit 400124231
Created 24 Jul 2011 19:47:54 UTC
Sent 24 Jul 2011 19:55:52 UTC
Received 26 Jul 2011 14:33:04 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 1443602
Report deadline 3 Aug 2011 19:55:52 UTC
CPU time 31199.31

stderr out

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
[2011- 7-26 5:28:19:] :: BOINC:: Initializing ... ok.
[2011- 7-26 5:28:19:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev42272.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/T610_bn_rs_stg0_lrlxMultiCst_t000__casp9.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 86400


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x3FF00000 read attempt to address 0x3FF00000

Engaging BOINC Windows Runtime Debugger...



********************


BOINC Windows Runtime Debugger Version 6.5.0


Dump Timestamp : 07/26/11 15:06:02
Install Directory : C:\Program Files\BOINC\
Data Directory : C:\Documents and Settings\All Users\Application Data\BOINC\alphatesti
Project Symstore :
Loaded Library : C:\Program Files\BOINC\\dbghelp.dll
Loaded Library : C:\Program Files\BOINC\\symsrv.dll
Loaded Library : C:\Program Files\BOINC\\srcsrv.dll
LoadLibraryA( C:\Program Files\BOINC\\version.dll ): GetLastError = 126
Loaded Library : version.dll
Debugger Engine : 4.0.5.0
Symbol Search Path: C:\Documents and Settings\All Users\Application Data\BOINC\alphatesti\slots\0;C:\Documents and Settings\All Users\Application Data\BOINC\alphatesti\projects\boinc.bakerlab.org_rosetta;srv*C:\Documents and Settings\All Users\Application Data\BOINC\alphatesti\projects\boinc.bakerlab.org_rosettasymbols*http://msdl.microsoft.com/download/symbols;srv*C:\Documents and Settings\All Users\Application Data\BOINC\alphatesti\projects\boinc.bakerlab.org_rosettasymbols*http://boinc.berkeley.edu/symstore


ModLoad: 00400000 00ffd000 C:\Documents and Settings\All Users\Application Data\BOINC\alphatesti\projects\boinc.bakerlab.org_rosetta\minirosetta_3.14_windows_intelx86.exe (-exported- Symbols Loaded)
Linked PDB Filename : D:\boinc_build\minirosetta_beta_3.14\mini\ide\VisualStudio\BoincRelease\minirosetta_beta_3.14_windows_intelx86.pdb

ModLoad: 7c900000 000b2000 C:\WINDOWS\system32\ntdll.dll (5.1.2600.6055) (PDB Symbols Loaded)
Linked PDB Filename : ntdll.pdb
File Version : 5.1.2600.6055 (xpsp_sp3_gdr.101209-1647)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.6055

ModLoad: 7c800000 000f6000 C:\WINDOWS\system32\kernel32.dll (5.1.2600.5781) (PDB Symbols Loaded)
Linked PDB Filename : kernel32.pdb
File Version : 5.1.2600.5781 (xpsp_sp3_gdr.090321-1317)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5781

ModLoad: 7e410000 00091000 C:\WINDOWS\system32\USER32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : user32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512

ModLoad: 77f10000 00049000 C:\WINDOWS\system32\GDI32.dll (5.1.2600.5698) (PDB Symbols Loaded)
Linked PDB Filename : gdi32.pdb
File Version : 5.1.2600.5698 (xpsp_sp3_gdr.081022-1932)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5698

ModLoad: 77dd0000 0009b000 C:\WINDOWS\system32\ADVAPI32.dll (5.1.2600.5755) (PDB Symbols Loaded)
Linked PDB Filename : advapi32.pdb
File Version : 5.1.2600.5755 (xpsp_sp3_gdr.090206-1234)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5755

ModLoad: 77e70000 00093000 C:\WINDOWS\system32\RPCRT4.dll (5.1.2600.6022) (PDB Symbols Loaded)
Linked PDB Filename : rpcrt4.pdb
File Version : 5.1.2600.6022 (xpsp_sp3_gdr.100813-1643)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.6022

ModLoad: 77fe0000 00011000 C:\WINDOWS\system32\Secur32.dll (5.1.2600.5834) (PDB Symbols Loaded)
Linked PDB Filename : secur32.pdb
File Version : 5.1.2600.5834 (xpsp_sp3_gdr.090624-1305)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5834

ModLoad: 76390000 0001d000 C:\WINDOWS\system32\IMM32.DLL (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : imm32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512

ModLoad: 77690000 00021000 C:\WINDOWS\system32\NTMARTA.DLL (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : ntmarta.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512

ModLoad: 77c10000 00058000 C:\WINDOWS\system32\msvcrt.dll (7.0.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : msvcrt.pdb
File Version : 7.0.2600.5512 (xpsp.080413-2111)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 7.0.2600.5512

ModLoad: 774e0000 0013e000 C:\WINDOWS\system32\ole32.dll (5.1.2600.6010) (PDB Symbols Loaded)
Linked PDB Filename : ole32.pdb
File Version : 5.1.2600.6010 (xpsp_sp3_gdr.100712-1633)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.6010

ModLoad: 71bf0000 00013000 C:\WINDOWS\system32\SAMLIB.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : samlib.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512

ModLoad: 76f60000 0002c000 C:\WINDOWS\system32\WLDAP32.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : wldap32.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2113)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512

ModLoad: 12950000 00115000 C:\Program Files\BOINC\dbghelp.dll (6.8.4.0) (PDB Symbols Loaded)
Linked PDB Filename : dbghelp.pdb
File Version : 6.8.0004.0 (debuggers(dbg).070515-1751)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.8.0004.0

ModLoad: 134e0000 00048000 C:\Program Files\BOINC\symsrv.dll (6.8.4.0) (PDB Symbols Loaded)
Linked PDB Filename : symsrv.pdb
File Version : 6.8.0004.0 (debuggers(dbg).070515-1751)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.8.0004.0

ModLoad: 13530000 0003b000 C:\Program Files\BOINC\srcsrv.dll (6.8.4.0) (PDB Symbols Loaded)
Linked PDB Filename : srcsrv.pdb
File Version : 6.8.0004.0 (debuggers(dbg).070515-1751)
Company Name : Microsoft Corporation
Product Name : Debugging Tools for Windows(R)
Product Version : 6.8.0004.0

ModLoad: 77c00000 00008000 C:\WINDOWS\system32\version.dll (5.1.2600.5512) (PDB Symbols Loaded)
Linked PDB Filename : version.pdb
File Version : 5.1.2600.5512 (xpsp.080413-2105)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 5.1.2600.5512



*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 20052, Write: 0, Other 200421

- I/O Transfers Counters -
Read: 0, Write: 346596, Other 0

- Paged Pool Usage -
QuotaPagedPoolUsage: 102500, QuotaPeakPagedPoolUsage: 102500
QuotaNonPagedPoolUsage: 3392, QuotaPeakNonPagedPoolUsage: 4520

- Virtual Memory Usage -
VirtualSize: 468652032, PeakVirtualSize: 495939584

- Pagefile Usage -
PagefileUsage: 243654656, PeakPagefileUsage: 274866176

- Working Set Size -
WorkingSetSize: 251064320, PeakWorkingSetSize: 282066944, PageFaultCount: 44854399

*** Dump of thread ID 2736 (state: Waiting): ***

- Information -
Status: Wait Reason: UserRequest, , Kernel Time: 1461874944.000000, User Time: 310538764288.000000, Wait Time: 8795556.000000

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x3FF00000 read attempt to address 0x3FF00000

- Registers -
eax=143daa00 ebx=0ad6a830 ecx=0a337ac0 edx=00000060 esi=00000060 edi=0ad800c0
eip=3ff00000 esp=01d8957c ebp=00000000
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246

- Callstack -
ChildEBP RetAddr Args to Child
01d89578 a881c0c9 01d8eb80 14057e00 1833e438 fffffffe !+0x0 SymFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '3ff00000'
00000000 00000000 00000000 00000000 00000000 00000000 !+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = 'a881c0c9'

*** Dump of thread ID 2788 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 468750.000000, User Time: 625000.000000, Wait Time: 8795557.000000

- Registers -
eax=03cffb44 ebx=00000000 ecx=a9595549 edx=e0000000 esi=00000000 edi=03cfff70
eip=7c90e514 esp=03cfff40 ebp=03cfff98
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202

- Callstack -
ChildEBP RetAddr Args to Child
03cfff3c 7c90d21a 7c8023f1 00000000 03cfff70 00000000 ntdll!_KiFastSystemCallRet@0+0x0 FPO: [0,0,0]
03cfff40 7c8023f1 00000000 03cfff70 00000000 7c802446 ntdll!_NtDelayExecution@8+0x0 FPO: [2,0,0]
03cfff98 7c802455 00000064 00000000 03cfffec 004088db kernel32!_SleepEx@8+0x0
03cfffa8 004088db 00000064 00000000 7c80b729 00000000 kernel32!_Sleep@4+0x0
03cfffec 00000000 004088d0 00000000 00000000 6a510000 minirosetta_3.14_windows_intelx!+0x0

*** Dump of thread ID 2280 (state: Waiting): ***

- Information -
Status: Wait Reason: ExecutionDelay, , Kernel Time: 0.000000, User Time: 0.000000, Wait Time: 8795497.000000

- Registers -
eax=0d56fe28 ebx=0a4adb01 ecx=0d56e734 edx=0000439e esi=00000000 edi=0d56fdf8
eip=7c90e514 esp=0d56fdc8 ebp=0d56fe20
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202

- Callstack -
ChildEBP RetAddr Args to Child
0d56fdc4 7c90d21a 7c8023f1 00000000 0d56fdf8 000000d2 ntdll!_KiFastSystemCallRet@0+0x0 FPO: [0,0,0]
0d56fdc8 7c8023f1 00000000 0d56fdf8 000000d2 00015180 ntdll!_NtDelayExecution@8+0x0 FPO: [2,0,0]
0d56fe20 7c802455 000007d0 00000000 00003840 0060c4d2 kernel32!_SleepEx@8+0x0
0d56fe30 0060c4d2 000007d0 a40fab09 ffffffff 0a4adbf8 kernel32!_Sleep@4+0x0
0d56fe38 a40fab09 ffffffff 0a4adbf8 0d56ff6c 0a4adbf8 minirosetta_3.14_windows_intelx!cppdb::backend::driver::connect+0x0
0d56fe3c ffffffff 0a4adbf8 0d56ff6c 0a4adbf8 00000001 minirosetta_3.14_windows_intelx!+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = 'a40fab09'
0d56ff3c 7c917c51 7c917d08 7c800000 0d56ff7c 00000000 minirosetta_3.14_windows_intelx!+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = 'ffffffff'
0d56ffe0 7c80b72f 00000000 00000000 00000000 00414e52 ntdll!_LdrpGetProcedureAddress@20+0x0 SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '7c917c51'
0d56ffe4 00000000 00000000 00000000 00414e52 0a4adbf8 kernel32!_BaseThreadStart@8+0x0 FPO: [0,0,0] SymFromAddr(): GetLastError = '126' SymGetLineFromAddr(): GetLastError = '126' SymGetModuleInfo(): GetLastError = '126' Address = '7c80b72f'


*** Debug Message Dump ****


*** Foreground Window Data ***
Window Name :
Window Class :
Window Process ID: 0
Window Thread ID : 0

Exiting...

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 177.651213902221
Granted credit 0
application version 3.14

Why my task failed?

[VENETO] boboviz Profile

Joined: Dec 1 05
Posts: 556
ID: 25524
Credit: 1,559,760
RAC: 1,171
Message 70860 - Posted 2 Aug 2011 19:00:01 UTC

439628254

Tag::read - parse error, printing backtrace.

Tag::read - parse error - file:istream line:5 column:1 - </SFXN5>
Tag::read - parse error - file:istream line:5 column:1 - ^

Tag::read - parse error - file:istream line:6 column:1 - </SCOREFXNS>
Tag::read - parse error - file:istream line:6 column:1 - ^

Tag::read - parse error - file:istream line:9 column:1 - </FILTERS>
Tag::read - parse error - file:istream line:9 column:1 - ^

Tag::read - parse error - file:istream line:13 column:1 - </TASKOPERATIONS>
Tag::read - parse error - file:istream line:13 column:1 - ^

Tag::read - parse error - file:istream line:15 column:1 - <FlxbbDesign name=flxbb ncycles=2 sfxn_design=SFXN5 sfxn_relax=SFXN5 SFXN5 clear_all_residues=1 task_operations=limitchi2,layer_allclear_all_residues=0 blueprint="master.blueprint" constraints_NtoC=1.0 />
Tag::read - parse error - file:istream line:15 column:1 - ^

Tag::read - parse error - file:istream line:14 column:1 - <MOVERS>
Tag::read - parse error - file:istream line:14 column:1 - ^

Tag::read - parse error - file:istream line:1 column:1 - <dock_design>
Tag::read - parse error - file:istream line:1 column:1 - ^


ERROR: false
ERROR:: Exit from: ..\..\..\src\utility\tag\Tag.cc line: 387
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 70896 - Posted 4 Aug 2011 7:48:05 UTC
Last modified: 4 Aug 2011 8:03:57 UTC

Hi.

These tasks are finishing after 16min's & 1 decoy, is this what you wanted/expected?

I've had two do it so far, & counting.

flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_08_29965_6412_0

# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 1201 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

ps / Good to see some work. :)

Edit // Seems like they are all getting validate errors now. :(
____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 70927 - Posted 6 Aug 2011 2:54:02 UTC
Last modified: 6 Aug 2011 2:56:20 UTC

Hi.

Is anyone looking at these tasks, i've got another 4 of the same with the same problem as i reported earlier.

I'm i wasting my time reporting this?

All are getting validate errors.

flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_08_29965_8638_0

# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 1201 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>
____________


Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 70928 - Posted 6 Aug 2011 7:05:38 UTC

I've also had a bunch of them - so far I count 27 of them - same basic task name. Same validate error. Same claim that watchdog nailed them after 1201 seconds - although in most cases the task list shows they only ran somewhere between 750 and 950 seconds.

They have all been sent out to someone else for a "second try" - I've been good all week so maybe fate will smile on me and they will end up on Sid's bucket.

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70934 - Posted 6 Aug 2011 21:14:30 UTC - in response to Message ID 70896.
Last modified: 6 Aug 2011 21:19:38 UTC

Hi.

These tasks are finishing after 16min's & 1 decoy, is this what you wanted/expected?

I've had two do it so far, & counting.

flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_08_29965_6412_0

# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 1201 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

ps / Good to see some work. :)

Edit // Seems like they are all getting validate errors now. :(


For some of the previous workunits, decoy 1 was basically a test with an already known result for checking if your computer did the calculations properly. Could the validator still be assuming that any workunits where only one decoy was completed would not give any results not already known?

For those with this type of problem, I'd suggest mentioning how long you have set BOINC to allow workunits to continue before switching to some other project's workunits, whether you have any other BOINC projects enabled so it will actually try this switching, and whether you have BOINC set to keep workunits in memory when they are suspended. Also, you could mention which version of BOINC you're using in case it's one of those that did not initialize certain variables used in calculating how long workunits should be allowed to run.


On another subject, both of my computers usually gave minirosetta 3.14 workunits not completing properly (see earlier in this thread if you want details). Therefore, I've set Rosetta@Home to No New Tasks for a few weeks now, while waiting for a new minirosetta version likely to have this problem fixed. When is a new version likely? I haven't been seeing one tested on RALPH@Home yet.

Snagletooth

Joined: Feb 22 07
Posts: 193
ID: 149031
Credit: 1,425,415
RAC: 236
Message 70936 - Posted 6 Aug 2011 23:41:12 UTC
Last modified: 6 Aug 2011 23:42:48 UTC

This is not the first batch of work that behaved this way. I recall at least one other instance in which the workunit was ended after a single model was produced and, if memory serves, recorded a discrepancy in cpu time used.

I see 5 "flxdsgn" workunits in my tasks list:

1.flxdsgn_Ploop_2x3_no_sheet_constraint_11_29940_568_0
1 model completed, insignificant discrepancy in cpu time recorded, valid (3781.76, 3781.84)

2. flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_04_29965_5656_0
1 model completed, insignificant discrepancy in cpu time recorded, valid (1340.49, 1341.349)

3. flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_11_29965_5023_1
1 model completed, odd discrepancy, invalid(1201, 1172.478), 2nd copy invalid (1201, 825.2765)

4. flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_02_29965_7348_0
1 model, odd discrepancy, this copy invalid, (1201/ 1120.343), 2nd valid (2119.97 / 2920.167)

5. flxdsgn_Ploop_2x3_run_4_SAVE_ALL_OUT_IGNORE_THE_REST_03_29965_7238_1
1 model, insignificant discrepancy this copy valid (1249.71/1250.505), 1st copy invalid(1201/739.8815)

Notes on the cpu time:

The top section of the task details page is essentially identical for every project and includes a record of the cpu time used. The information included in the last section on this page, the sdterr output, is unique to each project, indeed can vary with each type of workunit within a project. It appears that each project decides what it wishes to record here and writes the code to include in the application. For rosetta workunits the cpu time used is recorded within the stderr out in addition to the standard location. The number in the stderr out is recorded alongside the number of models completed and must be recorded before the standard, common to all projects, let's call it the BOINC, method reported further up the details page. Thus the number in the stderr output is always fractions of a second smaller than the boinc number.

In the 3rd, 4th and 5th examples of flxdsgn workunits shown above, for all 4 invalid copies of the workunits, the cpu time used is recorded as 1201 in the stderr out. The BOINC recording is always considerably less, from 28.522 seconds less to 461.1185 seconds less.
Is it this, the BOINC time being less than the rosetta time, which causes the validator to mark the workunit invalid?
Or is it a discrepancy over a certain amount which triggers an invalidation?
Where does the 1201 come from?
Is it a default number for when the application has lost track of the time?
Is it an intentional signal designed to alert the project to a particular event within the workunit? (If x happens record the cpu time as 1201). Is the consequent invalidation with additional workunit creation needed or not?

I could go on but I'll spare you.

It may well be from the project's perspective this batch of workunits has preceded as expected, behaved as designed, efficiently and without waste of resources. But from our perspective things look off. We would be a lot more helpful in notifying the project of problems (as we've been asked and are clearly willing to do) if the project kept us better informed about the behaviour we should expect to see.


Best,
Snags

edited to add:
Robert suggests a reason for designing a workunit to produce a single model but not the odd cpu time or the invalidation. In case he's onto something: my switch interval is 270 minutes, preferred run time is 12 hours, always running workunits(of various lengths) from several projects even when using only one core, workunits are kept in memory, BOINC 6.10.21 for Mac.

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70937 - Posted 7 Aug 2011 1:38:54 UTC - in response to Message ID 70936.
Last modified: 7 Aug 2011 1:57:27 UTC

edited to add:
Robert suggests a reason for designing a workunit to produce a single model but not the odd cpu time or the invalidation. In case he's onto something: my switch interval is 270 minutes, preferred run time is 12 hours, always running workunits(of various lengths) from several projects even when using only one core, workunits are kept in memory, BOINC 6.10.21 for Mac.


An idea to check: If a wingman is validated, check how many decoys the wingman produced.

Editted: A failed idea. All the workunits of that type that I was able to look at the log file for produced only one decoy, regardless of whether they validated or not.

Snagletooth

Joined: Feb 22 07
Posts: 193
ID: 149031
Credit: 1,425,415
RAC: 236
Message 70938 - Posted 7 Aug 2011 2:11:54 UTC - in response to Message ID 70937.
Last modified: 7 Aug 2011 2:14:56 UTC

edited to add:
Robert suggests a reason for designing a workunit to produce a single model but not the odd cpu time or the invalidation. In case he's onto something: my switch interval is 270 minutes, preferred run time is 12 hours, always running workunits(of various lengths) from several projects even when using only one core, workunits are kept in memory, BOINC 6.10.21 for Mac.


An idea to check: If a wingman is validated, check how many decoys the wingman produced.


In every instance of the flxdsgn workunits I listed above only one model was produced. The single model appears to be a feature of this type of workunit and unrelated to the 1201 issue or the validation errors.

Perhaps model doesn't mean the same thing for this type of workunit as it does for most (all?) other rosetta workunits. What if the application is ended not by time limits or model number limits but by something else, the occurrence of some other event? The model number would then be created for credit granting purposes only.


Best,
Snags

edited to say: Hey, Robert, just saw your edit : )

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70939 - Posted 7 Aug 2011 2:55:25 UTC
Last modified: 7 Aug 2011 3:18:11 UTC

I've now looked at output files for this type of workunit for several users that made their outputs public enough that I could.

Each output file seems to show two different CPU times - one for the entire workunit, and a different one for decoy 1 alone. The validator appears to have accepted all cases where both were greater than 1201 seconds, and rejected all cases where the time reported for decoy 1 was 1201 seconds and the total CPU time for the entire workunit was less than that. I did not find any other cases for flxdsgn_Ploop workunits, even after looking at dozens of log files.

Therefore, this looks likely to be a case where the time reported on the Done :: line is often 1201 seconds, but never less, even if this value is incorrect. It appears that the developers need to add debugging for the calculations of this value for any future workunits of this type. Also, they should check whether the the value on this line should have been ignored by the validator if it happens to be 1201, and if so, update the validator so it will - and also run the updated validator on all the validation failed workunits of this type to see if more credit should be awarded to those computers.


I saw no significant differences in this for which CPU or which OS was used, or even which BOINC version was used.

I also noted that the values of WS_max shown a few lines below the Done :: line look very random, but saw no particular pattern of whether this affects validation.

Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 70940 - Posted 7 Aug 2011 4:04:25 UTC

BINGO Robert - good eye.

I just screened about 100 of these tasks on two of my hosts (129350 & 1300412) and in each and every case your hypothesis was correct. Not only in the case of my hosts, but also for those of my "wingmen" whose systems are fast enough to complete the decoy in less than 1201 seconds.

I did not spot a single case where a system got a clean validation when the decoy was produced in less than 1201 seconds.

So I guess the bad news is that I have a few hundred of these tasks on the four hosts I currently have dedicated to Rosetta. The good news is they blow through pretty quick.

Now I guess its time to see if I can spot these in the queue and abort them before they start with some sort of chron script.

Thanks

Ed

Joined: Aug 2 11
Posts: 31
ID: 425735
Credit: 441,759
RAC: 0
Message 70941 - Posted 7 Aug 2011 4:23:39 UTC
Last modified: 7 Aug 2011 4:24:58 UTC

Being new here I don't fully understand this conversation so I hope you don't mind my asking questions.

Are one of you guys developing the work units or are you all crunchers?

Assuming you are all crunchers, why would you abort work units like that, assuming they are not stuck. If the guys developing work units want to see how they run, aren't you working against them?

Sorry if these are dumb questions.

We seem to be out of Rosetta work so my system is just happy crunching away on Seti. I have two Astropulse WU running.

Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 70942 - Posted 7 Aug 2011 4:42:54 UTC

@ED - in my case I am not a developer for BOINC or Rosetta. My reason for being here is two fold. First, I believe the work being done by the Rosetta project is important. The second reason is BOINC / Rosetta provides a nice solid testbed for the optimized Linux kernels I build in another life. The amount of credit granted by Rosetta is consistent enough that after running for a week to ten days I am able to evaluate the value of the optimizations I am testing.

Aborting a task is not a normal thing - I have never attempted to automate an abort in the past - however it is becoming clear that if your machine is fast enough to complete a flxdsgn_Ploop task in less than 1201 seconds you are not going to get a clean validation and the completed task will be sent to someone else for a second attempt.

It appears that there is a bug in these routines and if you have a fairly fast machine, you may be predestined to have the task complete with a validation error.

By the way welcome to the project - hope you enjoy associating with some weird (or diverse) folks.

Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 70943 - Posted 7 Aug 2011 4:53:01 UTC
Last modified: 7 Aug 2011 4:53:39 UTC

ED - one more thing - your comment about "working against the developers" - I understand exactly what you mean but in this case I think the developers probably have a thousand or more of these validate failures to evaluate.

Further, one of the real downsides to the Rosetta project is that there is almost no communication between the sysadmins and developers and those doing the crunching.

If the developer/scientist/student responsible for these tasks were to come out and state that they needed to look at a few more failure cases I would be pleased as punch to provide them.

However, the way it is around here there is a fairly good chance that the developer is not even aware of the issue, or he is aware, already has a fix in hand and is just letting the "broke" tasks flow through the system.

It is not likely you will ever see a post here from the project explaining what happened. Sad but true.

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 70944 - Posted 7 Aug 2011 5:14:46 UTC - in response to Message ID 70941.
Last modified: 7 Aug 2011 5:29:19 UTC

Being new here I don't fully understand this conversation so I hope you don't mind my asking questions.

Are one of you guys developing the work units or are you all crunchers?

Assuming you are all crunchers, why would you abort work units like that, assuming they are not stuck. If the guys developing work units want to see how they run, aren't you working against them?

Sorry if these are dumb questions.

We seem to be out of Rosetta work so my system is just happy crunching away on Seti. I have two Astropulse WU running.


I'm no developer either, only a cruncher with crunching experience with most of the BOINC projects related to medical research. I do have years of experience fixing some types of software bugs, but never for BOINC projects. The best I can do seems to be searching for where to point the developers' attention to where to look for the bugs.

I've seen such problems with minirosetta 3.14 on my computers that I've set it to No New Tasks for now, and plan to leave it that until some later version is available. I've already sent rather long posts on what I saw in my failed 3.14 workunits, and see no reason to try any more until the developers offer some response to those posts. RALPH@Home is still enabled on my desktop, though, since that's where they'll probably try the initial testing of any fixes.

If you look at the left column of this thread, you should see which users are labelled developers. Scroll to the top of the thread to see an example. Note that none of the developers have posted anything to this thread since this latest problem was reported, so let's hope that they're all busy trying to fix it, rather than, for example, all on vacation. Running out of workunits is a sign that SOMEONE at Rosetta@Home has recognized the problem, even if that someone was not a developer and cannot do much to fix the problem.

Snagletooth

Joined: Feb 22 07
Posts: 193
ID: 149031
Credit: 1,425,415
RAC: 236
Message 70950 - Posted 7 Aug 2011 14:39:39 UTC

The project runs a script, nightly I think, to grant credit for workunits that have failed validation. You won't find it on your tasks lists or workunits pages. You have to scroll to the bottom of the task details page to see it. All the invalid flxdsgn I've looked at have received credit (after a day or so).

Validate errors are not client errors and don't necessarily mean the workunits have failed and the results are useless. I see no reason to abort them. I would though, very much like the project to chime in here and tell us what invalidation means in this particular case.

Best,
Snags, just another volunteer cruncher

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 70960 - Posted 7 Aug 2011 20:37:00 UTC

8 memory error tasks and 2 validate error tasks.
Getting old guys!

No tasks queued, so your working on this?
Why not say something?

Ed

Joined: Aug 2 11
Posts: 31
ID: 425735
Credit: 441,759
RAC: 0
Message 70979 - Posted 8 Aug 2011 18:57:01 UTC

Thanks for the comments guys.

I am not currently getting any Rosetta work units. All my cpu time is going to Seti. At least that one run consistently.

Sid Celery

Joined: Feb 11 08
Posts: 806
ID: 241409
Credit: 10,030,156
RAC: 9,347
Message 70984 - Posted 9 Aug 2011 2:46:20 UTC - in response to Message ID 70950.

I've been good all week so maybe fate will smile on me and they will end up on Sid's bucket.

Oi! Blooming cheek! ;)

The project runs a script, nightly I think, to grant credit for workunits that have failed validation. You won't find it on your tasks lists or workunits pages. You have to scroll to the bottom of the task details page to see it. All the invalid flxdsgn I've looked at have received credit (after a day or so).

Validate errors are not client errors and don't necessarily mean the workunits have failed and the results are useless. I see no reason to abort them. I would though, very much like the project to chime in here and tell us what invalidation means in this particular case.

Beat me to it it - agree. Not a problem for me. I don't need to be told why, as long as the project team see it and know why. It's not of any concern to me.
____________

Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 70985 - Posted 9 Aug 2011 3:32:54 UTC

@Sid: Yes, I know about the script, and quite frankly I really don't give a rat's posterior about the credits - lost or gained.

My logic in aborting these tasks while they were still in the queue was based on efficiency. To me it made no sense to run a work unit that was predestined to generate a validation error just to have it routed to a wingman to be recomputed.

Why compute the same work units twice or even three times (if the wingman also had a fast processor)

As far as being cheeky? I think that you are just showing your insecurity in the face of American exceptionalism. We may be the upstarts on the block, but we're catching up. For years you could proudly claim to have the world leader with the biggest ears.

But I think that Prince Charles has now been eclipsed by Obama.

Sid Celery

Joined: Feb 11 08
Posts: 806
ID: 241409
Credit: 10,030,156
RAC: 9,347
Message 70988 - Posted 9 Aug 2011 10:38:53 UTC - in response to Message ID 70985.

My logic in aborting these tasks while they were still in the queue was based on efficiency. To me it made no sense to run a work unit that was predestined to generate a validation error just to have it routed to a wingman to be recomputed.

I think there's some value in a job reporting a failure to run (presumably in a pattern already detected by crunchers), especially if it only runs 20 minutes, rather than reporting as aborted.

As far as being cheeky? I think that you are just showing your insecurity in the face of American exceptionalism. We may be the upstarts on the block, but we're catching up. For years you could proudly claim to have the world leader with the biggest ears.

But I think that Prince Charles has now been eclipsed by Obama.

lol ;) Where did you hear Charles was a leader of anything?
____________

Snagletooth

Joined: Feb 22 07
Posts: 193
ID: 149031
Credit: 1,425,415
RAC: 236
Message 70989 - Posted 9 Aug 2011 13:19:26 UTC - in response to Message ID 70985.


My logic in aborting these tasks while they were still in the queue was based on efficiency. To me it made no sense to run a work unit that was predestined to generate a validation error just to have it routed to a wingman to be recomputed.

Why compute the same work units twice or even three times (if the wingman also had a fast processor)


I was under the impression, quite possibly inaccurate, that resends are not exact duplicates but are additional copies. The second cruncher is not in fact rerunning the exact same models of the first cruncher but rather running additional models. Perhaps Mod.Sense can clarify.

Further, if invalidation does not prevent interesting models from being more closely examined by the project scientists, then there's no reason not to continue running these types of tasks even in the face of frequent invalidation.

Although, one, frequent invalidation is annoying, they really should fix that; and two, I could be wrong on either or both points.

Best,
Snags

Ed

Joined: Aug 2 11
Posts: 31
ID: 425735
Credit: 441,759
RAC: 0
Message 70998 - Posted 10 Aug 2011 12:17:25 UTC

I would be interested on some information about resends. My understanding, frankly from Seti, and other grid projects I have been part of, is that each WU is sent to multiple computers. When the work comes back they are compared to validate the results.

The 3 way comparison is the one I seem to recall as being most common. If they send out three and two match, they are marked good and the third as not good.

I realize this is not a requirement, but would be interested to understand how this project works.

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 71002 - Posted 10 Aug 2011 13:54:04 UTC
Last modified: 10 Aug 2011 13:55:57 UTC

The last I knew, Rosetta@Home usually sent out only one of each workunit. If it came back with an error or was sent to a computer considered unreliable, they would send another. Also another for a small fraction of the computers considered reliable. If those two agreed, no need for a third one. If they disagreed, then a third one was sent. For most of the workunits, they had a fairly quick way of calculating how good the outputs were and could use that as part of deciding whether to send another copy. However, this was months ago, so it might not describe the current setup.

Snagletooth

Joined: Feb 22 07
Posts: 193
ID: 149031
Credit: 1,425,415
RAC: 236
Message 71008 - Posted 10 Aug 2011 22:33:43 UTC - in response to Message ID 71002.

The last I knew, Rosetta@Home usually sent out only one of each workunit. If it came back with an error or was sent to a computer considered unreliable, they would send another. Also another for a small fraction of the computers considered reliable. If those two agreed, no need for a third one. If they disagreed, then a third one was sent. For most of the workunits, they had a fairly quick way of calculating how good the outputs were and could use that as part of deciding whether to send another copy. However, this was months ago, so it might not describe the current setup.



Robert, I don't recall this ever being the procedure on rosetta. Perhaps you are thinking of malariacontrol.net? They use adaptive replication.


Ed, my speculations earlier in the thread regarding resends apply to rosetta @home only. If you go to your tasks list and click on the "workunit id" link you'll see that initial replication=1 and minimum quorum=1. Another copy will be sent only if the original is returned with an error, fails to validate or misses its deadline.

Some projects use multiple replications to prevent cheating or to discard results that produce the wrong answer but don't throw client errors. As I understand it the method rosetta uses is not depending on finding the single right answer but is collecting best guesses. For each experiment the project sends out hundreds (thousands?) of workunits in order to create tens (hundreds?) of thousands of models which they can then analyze statistically. A single computer returning garbage should not effect the results. Likewise the failure of a single workunit or single models within workunits is not a cause for concern.


Best,
Snags

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 71012 - Posted 10 Aug 2011 23:07:41 UTC - in response to Message ID 71008.

The last I knew, Rosetta@Home usually sent out only one of each workunit. If it came back with an error or was sent to a computer considered unreliable, they would send another. Also another for a small fraction of the computers considered reliable. If those two agreed, no need for a third one. If they disagreed, then a third one was sent. For most of the workunits, they had a fairly quick way of calculating how good the outputs were and could use that as part of deciding whether to send another copy. However, this was months ago, so it might not describe the current setup.



Robert, I don't recall this ever being the procedure on rosetta. Perhaps you are thinking of malariacontrol.net? They use adaptive replication.


Ed, my speculations earlier in the thread regarding resends apply to rosetta @home only. If you go to your tasks list and click on the "workunit id" link you'll see that initial replication=1 and minimum quorum=1. Another copy will be sent only if the original is returned with an error, fails to validate or misses its deadline.

Some projects use multiple replications to prevent cheating or to discard results that produce the wrong answer but don't throw client errors. As I understand it the method rosetta uses is not depending on finding the single right answer but is collecting best guesses. For each experiment the project sends out hundreds (thousands?) of workunits in order to create tens (hundreds?) of thousands of models which they can then analyze statistically. A single computer returning garbage should not effect the results. Likewise the failure of a single workunit or single models within workunits is not a cause for concern.


Best,
Snags


Possibly - I have my computers participating in most of the BOINC projects I've found connected to medical research, and it's often hard to keep track of which project is doing what.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 71016 - Posted 11 Aug 2011 0:16:31 UTC - in response to Message ID 70989.


My logic in aborting these tasks while they were still in the queue was based on efficiency. To me it made no sense to run a work unit that was predestined to generate a validation error just to have it routed to a wingman to be recomputed.

Why compute the same work units twice or even three times (if the wingman also had a fast processor)


I was under the impression, quite possibly inaccurate, that resends are not exact duplicates but are additional copies. The second cruncher is not in fact rerunning the exact same models of the first cruncher but rather running additional models. Perhaps Mod.Sense can clarify.

Further, if invalidation does not prevent interesting models from being more closely examined by the project scientists, then there's no reason not to continue running these types of tasks even in the face of frequent invalidation.

Although, one, frequent invalidation is annoying, they really should fix that; and two, I could be wrong on either or both points.

Best,
Snags


When tasks are resent, the second person gets the same task... and the same random seed that defines which exact models to run, but the second machine have a different runtime preference. So they will start out crunching exactly the same models, but may run more or less models then the first machine.

____________
Rosetta Moderator: Mod.Sense

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 71017 - Posted 11 Aug 2011 0:19:53 UTC

Right, R@h just sends the task to a single machine, and only issues resends when that task is not returned before deadline, or is returned with an error.

R@h does not define "reliable hosts". Some other projects do.

Bottom line, rather then have one machine waste its time simply double checking the work of another, it crunches new models noone else has done. Net result, the project gets a wider sampling of the search space.
____________
Rosetta Moderator: Mod.Sense

Ed

Joined: Aug 2 11
Posts: 31
ID: 425735
Credit: 441,759
RAC: 0
Message 71020 - Posted 11 Aug 2011 3:07:28 UTC

Thanks for the clarifications.

Snagletooth

Joined: Feb 22 07
Posts: 193
ID: 149031
Credit: 1,425,415
RAC: 236
Message 71030 - Posted 11 Aug 2011 16:12:37 UTC - in response to Message ID 71016.


When tasks are resent, the second person gets the same task... and the same random seed that defines which exact models to run, but the second machine have a different runtime preference. So they will start out crunching exactly the same models, but may run more or less models then the first machine.



So I was just crazy talking. Shucks, I really liked that idea. Ah, well.

Thanks, Mod.Sense, for clearing that up.

Best,
Snags

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 71031 - Posted 11 Aug 2011 19:44:48 UTC

Task 440910744 (T0423_3d01.pdb_boinc_symmetric_lr_symm_wangyr_IGNORE_THE_REST_30148_3650_0) gave a Validate error on Mac after completing one decoy.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 71033 - Posted 12 Aug 2011 2:56:06 UTC
Last modified: 12 Aug 2011 2:57:34 UTC

This crashed and burned:T0409_3d0f.pdb_boinc_symmetric_lr_symm_wangyr_IGNORE_THE_REST_30145_9598_0

It ran only 50% of its alotted time and as far as I can tell produced no decoys.

Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 71042 - Posted 13 Aug 2011 5:47:06 UTC

I am sure others have noted it already but tasks T0423* appear to be behaving in the exact same manner as the flxdsgn tasks of the past 10 days or so.

They are designed to generate only one decoy and if the system completes it in less than 1201 seconds it gets a validate error and is then sent to a second system.

Task ID 440948630 is an example where both my I and my wingman completed the task in less than 1201 seconds and we both got a validate error.

Task ID 440943948 is an example where my system completed the task in less than 1201 seconds and got a validate error while my wingman took 3350 seconds and got a success.


Chris Holvenstot Profile
Avatar

Joined: May 2 10
Posts: 220
ID: 379129
Credit: 9,106,918
RAC: 0
Message 71043 - Posted 13 Aug 2011 5:55:54 UTC

I'm sorry, I think I need to editorialize a little bit. The T0423* tasks in my post were generated this past Thursday, a full week after the "1201 second" problem was spotted by another participant here.

Yet here we go again? Does anyone at the project read this forum? Better yet, does anyone at the project do anything to verify that a known problem is not propagated into a new batch of tasks before they are released into the wild?

While the cause of the problem behind the "1201 second" issue may be complex and as yet not identified, its signature is easy to spot - and could have been picked up in even the most rudimentary dry runs.

Dang!

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 71050 - Posted 13 Aug 2011 13:23:56 UTC

The validation has no way to determine a specific number of seconds that would be valid, or invalid. The same amount of effort from a slow machine might mean it takes twice as long to reach that same point of execution. So any such signature is a red herring.

The true problem would seem to be elsewhere with how the tasks are being processed in some way. Which would explain why both machines that crunched it had the same problem.
____________
Rosetta Moderator: Mod.Sense

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 71066 - Posted 16 Aug 2011 12:47:37 UTC

Another workunit that's stopped using any CPU time:

minirosetta_3.14_windows_x86_64.exe
working set 618,000K
peak working set 651,888K

T0441_3d8u.pdb_boinc_symmetric_lr_symm_wangyr_IGNORE_THE_REST_29961_12936
max RAM usage 95 MB
CPU time at last checkpoint 02:23:47
CPU time 02:24:11
Elapsed time 28:40:13
Estimated time remaining 59:43:59
Fraction done 17.166%

8/13/2011 2:07:36 PM | | Starting BOINC client version 6.12.33 for windows_x86_64
8/13/2011 2:07:36 PM | | log flags: file_xfer, sched_ops, task
8/13/2011 2:07:36 PM | | Libraries: libcurl/7.19.7 OpenSSL/0.9.8l zlib/1.2.5
8/13/2011 2:07:36 PM | | Data directory: C:\ProgramData\BOINC
8/13/2011 2:07:36 PM | | Running under account Bobby
8/13/2011 2:07:36 PM | | Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz [Family 6 Model 23 Stepping 10]
8/13/2011 2:07:36 PM | | Processor: 6.00 MB cache
8/13/2011 2:07:36 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe
8/13/2011 2:07:36 PM | | OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
8/13/2011 2:07:36 PM | | Memory: 8.00 GB physical, 15.66 GB virtual
8/13/2011 2:07:36 PM | | Disk: 919.67 GB total, 541.21 GB free
8/13/2011 2:07:36 PM | | Local time is UTC -5 hours
8/13/2011 2:07:36 PM | | NVIDIA GPU 0: GeForce GTS 450 (driver version 28026, CUDA version 4000, compute capability 2.1, 993MB, 476 GFLOPS peak)

8/16/2011 12:17:04 AM | | Number of CPUs: 3
8/16/2011 12:17:04 AM | | 3026 floating point MIPS (Whetstone) per CPU
8/16/2011 12:17:04 AM | | 8778 integer MIPS (Dhrystone) per CPU
8/16/2011 12:17:05 AM | | Resuming computation
8/16/2011 2:04:43 AM | | Project communication failed: attempting access to reference site
8/16/2011 2:04:44 AM | | Internet access OK - project servers may be temporarily down.
8/16/2011 2:34:09 AM | | Project communication failed: attempting access to reference site
8/16/2011 2:34:10 AM | | Internet access OK - project servers may be temporarily down.
8/16/2011 2:48:39 AM | | Project communication failed: attempting access to reference site
8/16/2011 2:48:40 AM | | Internet access OK - project servers may be temporarily down.
8/16/2011 3:54:44 AM | rosetta@home | Sending scheduler request: To fetch work.
8/16/2011 3:54:44 AM | rosetta@home | Requesting new tasks for CPU
8/16/2011 3:54:45 AM | rosetta@home | Scheduler request completed: got 1 new tasks
8/16/2011 3:54:47 AM | rosetta@home | Started download of 2011_8_15_mini_s016_folding.zip
8/16/2011 3:54:54 AM | rosetta@home | Temporarily failed download of 2011_8_15_mini_s016_folding.zip: HTTP error
8/16/2011 3:54:55 AM | rosetta@home | Started download of 2011_8_15_mini_s016_folding.zip
8/16/2011 3:54:58 AM | | Project communication failed: attempting access to reference site
8/16/2011 3:54:59 AM | | Internet access OK - project servers may be temporarily down.
8/16/2011 3:55:09 AM | rosetta@home | Finished download of 2011_8_15_mini_s016_folding.zip

Requested runtime 12 hours
BOINC 6.12.33
64-bit Windows Vista Home Premium SP2
8 GB memory; BOINC allowed to use 40% of it
Set to keep workunits in memory when suspended

Now suspended; should I allow it to resume? Should I abort it? Is it best to set R@H to no new tasks?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 71071 - Posted 17 Aug 2011 3:45:28 UTC

Robert, those are the odd ones. The watchdog can't get at them, because it doesn't get any CPU either. If you exit and restart BOINC, they will generally straighten themselves out. If 600MB was more then comfortable for your machine, then it does little harm to cancel one once and a while.
____________
Rosetta Moderator: Mod.Sense

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 71074 - Posted 17 Aug 2011 13:56:53 UTC
Last modified: 17 Aug 2011 14:03:24 UTC

Another 3.14 workunit that stopped using any CPU time.

http://boinc.bakerlab.org/rosetta/result.php?resultid=442045657

T0610_3ot2.pdb_boinc_lr_control_nativechainA_loopbuild_threading_cst_relax_wangyr_IGNORE_THE_REST_30423_909
Max RAM usage 95 MB
CPU time at last checkpoint 00:24:30
CPU time 00:24:33
Elapsed time 08:52:22
Estimated time remaining 26:52:02
Fraction done 3.325%
Virtual memory size 325.62 MB
Working set size 333.91 MB

Note the large difference between Max RAM usage and the Working set size.

Peak working set 341.920 MB
BOINC 6.12.33
64-bit Windows Vista Home Premium SP2
8 GB memory; BOINC allowed to use 40% of it
Leave applications in memory when suspended
Tthrottle64 V4.20 running, but only to display the temperatures

Already aborted, rather than wait for an answer.

Rosetta@Home is on No new tasks; probably will stay there until I see some signs on RALPH@Home that something is being done about this.

600 MB is reasonable on this computer; going many hours doing nothing useful is not.

Paul van Dijken

Joined: Jun 21 10
Posts: 1
ID: 384618
Credit: 403,619
RAC: 744
Message 71146 - Posted 26 Aug 2011 8:26:04 UTC

After running WU rb_08_23_25236_50085_rs_stg0_lrlxMultiCst_t000__casp9__aln2_SAVE_ALL_OUT_30565_13_0 for 13+ hours and no progress beyond 12.700%, I aborted it.
This was the 3rd time in a few days it happened.
I stopped downloading Rosetta.
Any estimate when this issue is going to be solved?

____________

Ed

Joined: Aug 2 11
Posts: 31
ID: 425735
Credit: 441,759
RAC: 0
Message 71173 - Posted 1 Sep 2011 3:32:35 UTC

I am having a different problem. My BOINC is set to run 65% Seti and 35% Rosetta, but it is constantly running Rosetta in priority mode. This has been going on for days. It is like every WU is coming down and immediately goes into priority.

I have had no SETI WU for days so Rosetta has been getting all the cycle. Now that Seti is sending out WU again I expect things to balance back out but it is not happening. Seti is getting no CPU time at all.

I have suspended Rosetta to give SETI some run time.

this better stop. Anyone have any suggestions?

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 71176 - Posted 1 Sep 2011 5:07:49 UTC - in response to Message ID 71173.
Last modified: 1 Sep 2011 5:09:29 UTC

I am having a different problem. My BOINC is set to run 65% Seti and 35% Rosetta, but it is constantly running Rosetta in priority mode. This has been going on for days. It is like every WU is coming down and immediately goes into priority.

I have had no SETI WU for days so Rosetta has been getting all the cycle. Now that Seti is sending out WU again I expect things to balance back out but it is not happening. Seti is getting no CPU time at all.

I have suspended Rosetta to give SETI some run time.

this better stop. Anyone have any suggestions?


Do the Rosetta workunits happen to have due dates before those for SETI?

Is the total expected time to run all the Rosetta workunit greater than 35% of the time to their due dates?

Have you tried setting both Rosetta and SETI on No New Tasks until close to finishing all the downloaded workunits, then unsetting this for SETI first and getting a few SETI workunits, then unsetting it for Rosetta as well?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 71180 - Posted 1 Sep 2011 23:15:44 UTC
Last modified: 1 Sep 2011 23:17:32 UTC

The thing I would try, is leaving it alone. The lack of work from SETI is causing the BOINC Manager to get work from Rosetta, but then when it looks at the 35% resource share, and the debt to SETI, it starts to worry about completing the tasks on time, so it sets them to run first (which is all that "high priority" means after all).

You shouldn't have to suspend projects and micro-manage things to get the resource share you have selected... when work is available. When work is not available, it gets work from where it can... and makes it up to the other project when it starts producing work again.
____________
Rosetta Moderator: Mod.Sense

Ed

Joined: Aug 2 11
Posts: 31
ID: 425735
Credit: 441,759
RAC: 0
Message 71182 - Posted 2 Sep 2011 20:48:44 UTC

Thanks guys!

I am going to leave it alone, but I have set Rosetta to "no new WU" When it runs dry the Seti will get its time again.

It could be that, during the time when Rosetta had no work and Seti was getting all the time that a "debt" was built up and it is now being worked off.

Who knows, but I think you all for you analysis and recommendations.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 71183 - Posted 3 Sep 2011 6:44:57 UTC - in response to Message ID 71182.
Last modified: 3 Sep 2011 6:50:21 UTC

Thanks guys!

I am going to leave it alone, but I have set Rosetta to "no new WU" When it runs dry the Seti will get its time again.

It could be that, during the time when Rosetta had no work and Seti was getting all the time that a "debt" was built up and it is now being worked off.

Who knows, but I think you all for you analysis and recommendations.



by no new work you will create a debt again and the next time your start the project you will get an overload of rosetta work and seti will shut down until the debt is settled.

best thing to do is to set your percentage of rosetta much lower than seti.
then the deb issue will not be a factor and rosetta work will take a back seat to seti until seti dries up again.

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 71184 - Posted 3 Sep 2011 9:58:33 UTC - in response to Message ID 71183.
Last modified: 3 Sep 2011 10:37:14 UTC

Thanks guys!

I am going to leave it alone, but I have set Rosetta to "no new WU" When it runs dry the Seti will get its time again.

It could be that, during the time when Rosetta had no work and Seti was getting all the time that a "debt" was built up and it is now being worked off.

Who knows, but I think you all for you analysis and recommendations.



by no new work you will create a debt again and the next time your start the project you will get an overload of rosetta work and seti will shut down until the debt is settled.

best thing to do is to set your percentage of rosetta much lower than seti.
then the deb issue will not be a factor and rosetta work will take a back seat to seti until seti dries up again.


I've tried something similar, and found that if you set some BOINC project to such a low percentage that giving it only that percentage will not allow all the workunits you have already downloaded fron that project to complete on time, at least one of those workunits will almost immediately go into high priority mode. Shortening the queue of already downloaded workunits, if appropriate, before making any such change, works better.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 71185 - Posted 3 Sep 2011 13:05:14 UTC - in response to Message ID 71184.
Last modified: 3 Sep 2011 13:05:53 UTC

[quote]Thanks guys!

I am going to leave it alone, but I have set Rosetta to "no new WU" When it runs dry the Seti will get its time again.

It could be that, during the time when Rosetta had no work and Seti was getting all the time that a "debt" was built up and it is now being worked off.

Who knows, but I think you all for you analysis and recommendations.



by no new work you will create a debt again and the next time your start the project you will get an overload of rosetta work and seti will shut down until the debt is settled.

best thing to do is to set your percentage of rosetta much lower than seti.
then the deb issue will not be a factor and rosetta work will take a back seat to seti until seti dries up again.



so then what would happen if he set both projects to no new work, let the tasks clear out. redo his percentages and extra days to what he thinks will work and then allow new work to come in? This way he could start clean and let Boinc Mgr figure out what to do based on the new parameters.
I've tried something similar, and found that if you set some BOINC project to such a low percentage that giving it only that percentage will not allow all the workunits you have already downloaded fron that project to complete on time, at least one of those workunits will almost immediately go into high priority mode. Shortening the queue of already downloaded workunits, if appropriate, before making any such change, works better.

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 71187 - Posted 3 Sep 2011 23:23:38 UTC - in response to Message ID 71185.

[quote]Thanks guys!

I am going to leave it alone, but I have set Rosetta to "no new WU" When it runs dry the Seti will get its time again.

It could be that, during the time when Rosetta had no work and Seti was getting all the time that a "debt" was built up and it is now being worked off.

Who knows, but I think you all for you analysis and recommendations.



by no new work you will create a debt again and the next time your start the project you will get an overload of rosetta work and seti will shut down until the debt is settled.

best thing to do is to set your percentage of rosetta much lower than seti.
then the deb issue will not be a factor and rosetta work will take a back seat to seti until seti dries up again.



so then what would happen if he set both projects to no new work, let the tasks clear out. redo his percentages and extra days to what he thinks will work and then allow new work to come in? This way he could start clean and let Boinc Mgr figure out what to do based on the new parameters.


I've tried that also. Can start with an imbalance in the workunits, with the first project that asks for workunits getting more that its share. Generally not as bad an imbalance, though.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 71189 - Posted 4 Sep 2011 20:09:37 UTC

The best way for debt to balance out, is to leave it alone. Now that SETI has work coming, the BOINC Manager will figure out what it needs to do to balance the debt and deliver the resource shares you have selected.

As you say, there may have been a debt owed to Rosetta. If the two started equal, and SETI work dried up, you get a larger than average pile of Rosetta work. Then SETI comes back with work, and BOINC will figure out that it needs to both complete the tasks it has from Rosetta, and begin getting more from SETI than Rosetta to achieve the desired resource share.

All of the adjusting of resource shares, flagging as no new work, etc. is simply making it impossible for BOINC to figure out what you want.
____________
Rosetta Moderator: Mod.Sense

Ed

Joined: Aug 2 11
Posts: 31
ID: 425735
Credit: 441,759
RAC: 0
Message 71203 - Posted 6 Sep 2011 15:59:13 UTC
Last modified: 6 Sep 2011 15:59:34 UTC

Looks like BOINC has finaly balanced out as I am now getting a more normal distribution of processing time between the two projects.

Thanks for the recommendations.

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 71264 - Posted 15 Sep 2011 16:55:49 UTC

Task 448806358 (T590_cc_rs_stg0_lrlxMultiCst_t000__casp9__aln1_SAVE_ALL_OUT_31304_125_0) failed on Mac

ERROR: seqpos >=1 && seqpos <= size()
ERROR:: Exit from: src/core/conformation/Conformation.hh line: 268
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 71306 - Posted 21 Sep 2011 22:32:22 UTC

A couple of tasks called test_needle* failed in the middle of computation under W7 with the same error message,

ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6
ERROR:: Exit from: ..\..\..\src\core\pose\symmetry\util.cc line: 740
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

The tasks were 449325696 and 449325673

-------

Another test needle* task, 449325408 ran for an excessive length of time (>7 hours on a 3 hour preference) generating 2 decoys. The result was valid but there's a message in the log about an H-bond being tripped.

Hbond tripped: [2011- 9-20 10:24:55:]
BOINC:: CPU time: 25291.3s, 14400s + 10800s[2011- 9-20 15: 1:52:] :: BOINC
InternalDecoyCount: 2
======================================================
DONE :: 2 starting structures 25291.3 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================
called boinc_finish

alpha Profile

Joined: Nov 4 06
Posts: 27
ID: 127202
Credit: 953,255
RAC: 781
Message 71342 - Posted 27 Sep 2011 13:16:04 UTC

Compute error for work unit 408829805:

http://boinc.bakerlab.org/rosetta/result.php?resultid=450197233

The only problem I see is:

upload failure: <file_xfer_error>
<file_name>1AI8.ppk1.nobb_docking_benchmark_8Sep2011_30843_72_1_0</file_name>
<error_code>-131</error_code>
</file_xfer_error>
____________

[AF>france>pas-de-calais]symaski62

Joined: Sep 19 05
Posts: 47
ID: 506
Credit: 33,871
RAC: 0
Message 71360 - Posted 1 Oct 2011 23:26:53 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=452608726

Task ID 452608726
Name Aug20_needle_13start_test_SAVE_ALL_OUT__31431_61348_0


<core_client_version>6.12.33</core_client_version>
<![CDATA[
<stderr_txt>
[2011-10- 1 0:22:16:] :: BOINC:: Initializing ... ok.
[2011-10- 1 0:22:16:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev42272.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/Aug20_13start_needle.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 86400
[2011-10- 1 10:18:19:] :: BOINC:: Initializing ... ok.
[2011-10- 1 10:18:19:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev42272.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/Aug20_13start_needle.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 86400
Continuing computation from checkpoint: chk_S_00008_FragmentSampler__stage1 ... success!

ERROR: std::abs( coordsys_rot.det() - 1.0 ) < 1e-6
ERROR:: Exit from: ..\..\..\src\core\pose\symmetry\util.cc line: 740
called boinc_finish

</stderr_txt>
]]>

____________

entigy

Joined: Nov 2 05
Posts: 3
ID: 8517
Credit: 191,352
RAC: 385
Message 71362 - Posted 2 Oct 2011 16:01:00 UTC

I've just reconnected to Rosetta after some time away, and the 2 units I've completed both have a 'validation error'.
Is this going to happen with all the remaining Mini 3.14 units I have ?
If so, I might as well detach again ......

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 71363 - Posted 2 Oct 2011 21:11:07 UTC - in response to Message ID 71362.

I've just reconnected to Rosetta after some time away, and the 2 units I've completed both have a 'validation error'.
Is this going to happen with all the remaining Mini 3.14 units I have ?
If so, I might as well detach again ......


My three computers have already been on No New Tasks for Rosetta for weeks, but due to a different 3.14 problem. On some computers, including those, 3.14 workunits tend to crash in a way that does not manage to tell BOINC that the workunit is no longer running and some other workunit can now be started.

I'm getting better 3.14 results on RALPH@Home, though, so the developers may be working out a way to change the workunit inputs in a way that gives better results without changing the 3.14 program yet.

Therefore, I'd suggest setting Rosetta on No New Tasks for now, but letting the remaining workunits run to see if they will all at least finish properly.

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 71390 - Posted 7 Oct 2011 2:27:27 UTC

Task Aug20_needle_9start_test_SAVE_ALL_OUT__31432_91316_0 (452661954) failed on W7 after taking 7 hours on a 3 hour preference.

Watchdog active.
Hbond tripped: [2011-10- 5 14: 9:21:]
BOINC:: CPU time: 25478.3s, 14400s + 10800s[2011-10- 5 20:46:31:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)

Federico Fuga

Joined: Mar 4 09
Posts: 2
ID: 304405
Credit: 73,830
RAC: 0
Message 71399 - Posted 11 Oct 2011 10:18:57 UTC

Hi,

I have a rosetta mimi 3.14 job stuck at 8.595% since hours. Graphic application crashes. How can I check this issue? This IDs are: 455054413 415294178

Thank you

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 71400 - Posted 11 Oct 2011 15:18:36 UTC - in response to Message ID 71399.

Hi,

I have a rosetta mimi 3.14 job stuck at 8.595% since hours. Graphic application crashes. How can I check this issue? This IDs are: 455054413 415294178

Thank you


See the thread titled "Rosetta stops crunching". It might be helpful.

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 521,019
RAC: 10
Message 71402 - Posted 11 Oct 2011 19:42:59 UTC - in response to Message ID 71400.



See the thread titled "Rosetta stops crunching". It might be helpful.

Made Rosetta stops crunching into a clickable link for you
____________
Have a crunching good day!!

Federico Fuga

Joined: Mar 4 09
Posts: 2
ID: 304405
Credit: 73,830
RAC: 0
Message 71403 - Posted 12 Oct 2011 9:31:35 UTC

Thank you, restarting boinc did the job. Now rosetta jobs has resumed.
Thank you

Message boards : Number crunching : Minirosetta 3.14


Home | Join | About | Participants | Community | Statistics

Copyright © 2017 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^