Minirosetta 3.14

Message boards : Number crunching : Minirosetta 3.14

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

AuthorMessage
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 759
Credit: 9,986,485
RAC: 3,436
Message 70598 - Posted: 20 Jun 2011, 2:41:54 UTC

I just had one of the hung workunits, on a computer that's usually much more reliable at completing Rosetta@home workunits.

casd_sgr145_boinc_3duwA_208.nonlocal.pctid_0.09.tmscore_0.63331._nonlocal_tex_IGNORE THE REST_27533_3268

I've selected 12 hour workunit lengths.

Currently at 20:04:42 elapsed, 5.770% progress and not increasing, 53:36:27 to completion.

CPU time at last checkpoint 00:43:47
CPU time 00:43:51

Commit size 352,408 KB

BOINC lists it as running, but it's not using any CPU time.

The workunits on the other CPU cores are from other BOINC projects, and running just fine.

Appears to be one of the several workunits I've seen where the hang occurred just after a checkpoint.

BOINC appears likely to be in a state where it asks for GPU workunits only, and only from BOINC projects where I've never seen any available. I've seen this condition fairly often on my laptop before, but not on my desktop where it is now.

Clicking on the Show Graphics button brings up the minirosetta_graphics_3.13_windows_x86_64.exe program (previously not running) showing a window with a proper frame and a proper label at the top, but with the space inside the frame totally black. Clicking on the X in the red space at the top right corner of the graphics window gives this error message:

minirosetta_graphics_3.13_windows_x86_64.exe is not responding

Problem Event Name: AppHangB1
Hang Signature: dd05
Hang Type: 0
(too many more detail lines to copy if every time I enter a window to copy them to, or start Snipping Tool, the error message disappears.)

Should I abort this workunit so you can see the output files?

Or just restart BOINC to see if that will restart that workunit properly?

Or something else?


6/18/2011 5:48:13 PM Starting BOINC client version 6.10.58 for windows_x86_64
6/18/2011 5:48:13 PM log flags: file_xfer, sched_ops, task
6/18/2011 5:48:13 PM Libraries: libcurl/7.19.7 OpenSSL/0.9.8l zlib/1.2.3
6/18/2011 5:48:13 PM Data directory: C:ProgramDataBOINC
6/18/2011 5:48:13 PM Running under account Bobby
6/18/2011 5:48:13 PM Processor: 4 GenuineIntel Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz [Family 6 Model 23 Stepping 10]
6/18/2011 5:48:13 PM Processor: 6.00 MB cache
6/18/2011 5:48:13 PM Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 cx16 sse4_1 syscall nx lm vmx smx tm2 pbe
6/18/2011 5:48:13 PM OS: Microsoft Windows Vista: Home Premium x64 Edition, Service Pack 2, (06.00.6002.00)
6/18/2011 5:48:13 PM Memory: 8.00 GB physical, 15.66 GB virtual
6/18/2011 5:48:13 PM Disk: 919.67 GB total, 544.50 GB free
6/18/2011 5:48:13 PM Local time is UTC -5 hours
6/18/2011 5:48:13 PM NVIDIA GPU 0: GeForce GTS 450 (driver version 26724, CUDA version 3020, compute capability 2.1, 993MB, 476 GFLOPS peak)

6/18/2011 5:48:13 PM General prefs: using separate prefs for work
6/18/2011 5:48:13 PM Reading preferences override file
6/18/2011 5:48:13 PM Preferences:
6/18/2011 5:48:13 PM max memory usage when active: 3276.16MB
6/18/2011 5:48:13 PM max memory usage when idle: 3276.16MB
6/18/2011 5:48:13 PM max disk usage: 30.00GB
6/18/2011 5:48:13 PM max CPUs used: 3
6/18/2011 5:48:13 PM (to change preferences, visit the web site of an attached project, or select Preferences in the Manager)
6/18/2011 5:48:13 PM Not using a proxy
ID: 70598 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 759
Credit: 9,986,485
RAC: 3,436
Message 70599 - Posted: 20 Jun 2011, 2:48:29 UTC - in response to Message 70597.  
Last modified: 20 Jun 2011, 3:12:11 UTC

Have you thought of creating a test application specifically to gather more information on the computer environment it is running on, then sending one such workunit to each machine known to have a problem with workunits freezing? No objection if it then goes on to attempt to run a normal workunit afterwards, possibly with more debugging output than usual enabled.

You'd probably also want to send such workunits to a variety of other computers, to gather outputs for comparison to those from the problem computers.

For example, if it is able to capture the line from the BOINC manager log file describing the CPU capabilities, and perhaps those describing GPU capabilities, just those lines should offer a good starting point in deciding what to look for, if it happens to be something related to matching up properly to the CPU type.


I don't see anything about the specs of the machines that would give a direct indication. It happened the most recently on 3 different ones:
- P4 2.8Mhz (single core), 2GB RAM, just sitting idle most of the time as it is my Windows 2008 test server
- Vaio notebook, also sitting idle most of the time as the build-in keyboard is reluctant to work and I need to use an external one, Pentium M760 2GHz, 2GB RAM
- my main work computer, Core 2 Duo 6300@1.866GHz, 4GB RAM (3GB avail under XPSP3)

The last one has the most "freezes", but always plenty of CPU and RAM to spare (average 1GB physical RAM free).

Ralf


Combining your system types and mine suggests that it might be worthwhile checking if it is specific to Intel CPUs, and even perhaps some ranges of Intel CPU types.

As for what's next on Ralph@Home, it currently has no workunits queued, so you'll have to wait for some to become available.


Since I currently have an Einstein@Home CPU workunit that seems rather reluctant to finish in a reasonable time (perhaps due to the debt from the several Einstein@Home GPU workunits run recently, I've decided to drain the queue of CPU workunits on my desktop by temporarily setting all BOINC projects offering CPU workunits to No New Tasks.
ID: 70599 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1016
Credit: 3,981,630
RAC: 202
Message 70600 - Posted: 20 Jun 2011, 3:17:46 UTC

If you have stalled work units please go ahead and abort the jobs and manually kill the minirosetta process from the task manager if necessary. we'll post an update on Ralph sometime early next week and submit more test work units. Please post the names of the work units so we know which ones to test on Ralph. I'd also recommend suspending R@h for the time being.
ID: 70600 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile alpha

Send message
Joined: 4 Nov 06
Posts: 27
Credit: 1,545,892
RAC: 458
Message 70603 - Posted: 20 Jun 2011, 13:54:35 UTC

Computation error: http://boinc.bakerlab.org/rosetta/result.php?resultid=430150155

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x750F9617
ID: 70603 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Holmis

Send message
Joined: 15 Nov 07
Posts: 6
Credit: 975,490
RAC: 0
Message 70604 - Posted: 20 Jun 2011, 15:36:54 UTC

I've also got a computation error on this task.

<message>
Felaktig funktion. (0x1) - exit code 1 (0x1)
</message>

and

ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: ......srccoreimport_poseimport_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Translation:
Felaktig funktion = Incorrect function
ID: 70604 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 759
Credit: 9,986,485
RAC: 3,436
Message 70605 - Posted: 20 Jun 2011, 16:54:58 UTC

One more where BOINC thinks it's running, but it's using no CPU time at all:

ilv_hr41_all_boinc_2ebmA_108.nonlocal.pctid_0.14.tmscore_0.45048._nonlocal_tex_IGNORE_THE_REST_27535_3351

12 hour workunits requested.

13:38:38 elapsed, 62.456% progress, 07:46:40 To completion

CPU time at last checkpoint 07:31:42
CPU time 07:31:46

Appears to be one more of the many I've seen that stopped using any CPU time shortly after a checkpoint, or possibly after the checkpoint was started but not yet finished.

Rosetta@Home already on No new tasks while I drain the list of CPU workunits to force one especially slowly running workunit to get enough CPU time to finish. No more already on that computer.

Computer environment already described above.
ID: 70605 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan J Rodger

Send message
Joined: 16 Oct 05
Posts: 7
Credit: 32,282
RAC: 0
Message 70607 - Posted: 20 Jun 2011, 17:29:49 UTC

I've had to abort several work units because time elapsed and time to completion both go up without % completion changing - one work unit reached 25 hours and went from ca 3 hours to completion at the start to 18 hours to completion. Many are minirosetta 3.14. What's up?

Alan
ID: 70607 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1016
Credit: 3,981,630
RAC: 202
Message 70608 - Posted: 20 Jun 2011, 17:53:41 UTC

There are reports of minirosetta 3.14 continuing on with 0 cpu usage. This sounds like the same issue. You'll have to manually kill the process using the task manager and suspend the R@h project for the time being. We are looking into this issue.
ID: 70608 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 45
Credit: 14,619,670
RAC: 13,258
Message 70610 - Posted: 20 Jun 2011, 23:50:45 UTC

Compute error after 2.1 seconds, wingman had the same

ilv_fgf2_all_boinc_2ilaA_139.nonlocal.pctid_0.12.tmscore_._nonlocal_tex_IGNORE_THE_REST_27534_2350_1

ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: src/core/import_pose/import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


workunit - http://boinc.bakerlab.org/rosetta/workunit.php?wuid=392925238
ID: 70610 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
N7QLT

Send message
Joined: 19 Dec 05
Posts: 2
Credit: 1,965,577
RAC: 176
Message 70613 - Posted: 21 Jun 2011, 15:28:48 UTC

I have noticed occasions when workunits are running but not using any CPU. If I exit Boinc and request it to stop science projects, wait a moment and then restart Boinc it seems to reset "something" and the Rosetta projects start running again. I am guessing that it is related to some other process on the system, maybe nightly virus scans or other maintenance. Maybe its related to the workunit in progress. Just haven't had time to research it further.
ID: 70613 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Holmis

Send message
Joined: 15 Nov 07
Posts: 6
Credit: 975,490
RAC: 0
Message 70614 - Posted: 21 Jun 2011, 19:06:12 UTC

Got one more compute error today, it's the same error as I posted in message #70604 earlier in this thread.

ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: ......srccoreimport_poseimport_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

It also appears to be the same type om task:
ilv_fgf2_all_boinc_2ilaA_139.nonlocal.pctid_0.12.tmscore_._nonlocal_tex_IGNORE_THE_REST_27534_4527_0

and

ilv_fgf2_all_boinc_2ilaA_139.nonlocal.pctid_0.12.tmscore_._nonlocal_tex_IGNORE_THE_REST_27534_6682_0

Link to new error: http://boinc.bakerlab.org/rosetta/result.php?resultid=431062430


Something wrong with them?
ID: 70614 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 45
Credit: 14,619,670
RAC: 13,258
Message 70616 - Posted: 21 Jun 2011, 21:26:54 UTC

It looks like the ilv_fgf2_all_boinc units have a problem another one failed after 2.1 seconds same error as befor.

ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: src/core/import_pose/import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=393122110
ID: 70616 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James Thompson

Send message
Joined: 13 Oct 05
Posts: 46
Credit: 186,109
RAC: 0
Message 70623 - Posted: 22 Jun 2011, 16:36:30 UTC - in response to Message 70616.  

It looks like the ilv_fgf2_all_boinc units have a problem another one failed after 2.1 seconds same error as befor.

ERROR: Cannot open PDB file "2ilaA.pdb"
ERROR:: Exit from: src/core/import_pose/import_pose.cc line: 199
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


http://boinc.bakerlab.org/rosetta/workunit.php?wuid=393122110



Something is wrong with those workunits. I'll remove them now.


ID: 70623 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 70625 - Posted: 22 Jun 2011, 23:22:45 UTC
Last modified: 22 Jun 2011, 23:24:43 UTC

This is not funny, why is this still happening.

This task that should have finished after 49 models, I believe it should stopped but for some reason started again and did one more model and that,s all i'm getting credit for after 4+ hrs.

# cpu_run_time_pref: 14400
======================================================
DONE :: 49 starting structures 14250.2 cpu seconds
This process generated 49 decoys from 49 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 14501.5 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

Valid
Claimed credit__119.52
Granted credit__1.81

application version 3.14
ID: 70625 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alan J Rodger

Send message
Joined: 16 Oct 05
Posts: 7
Credit: 32,282
RAC: 0
Message 70632 - Posted: 24 Jun 2011, 15:11:07 UTC

Why don't you stop sending out Minirosetta 3.14 work units until you solve the problem? The other units seem to work.
ID: 70632 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 759
Credit: 9,986,485
RAC: 3,436
Message 70633 - Posted: 24 Jun 2011, 16:50:01 UTC - in response to Message 70632.  

Why don't you stop sending out Minirosetta 3.14 work units until you solve the problem? The other units seem to work.


Possibly because the Minirosetta 3.14 workunits work on some computers. For example, my laptop. Possibly because they want to gather more information on what kinds of computers those workunits don't work on.
ID: 70633 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 212
Credit: 7,933,755
RAC: 4,233
Message 70634 - Posted: 24 Jun 2011, 18:44:58 UTC

Task 430561413 failed with an Out of Memory message


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x758B9617

Engaging BOINC Windows Runtime Debugger...

(much debugging stuff snipped)

Odd considering it's a C2D with 4M of memory and these tasks don't seem to use more than 300-400K.

I've also been having a lot of these 'task hanging' issues on W7 (not Mac) : they're curable by quitting BOINC and restarting. Haven't been keeping track of all the names but tasks with names like ilv* seem particularly prone to this behaviour.
ID: 70634 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[AF>france>pas-de-calais]symaski62

Send message
Joined: 19 Sep 05
Posts: 47
Credit: 33,871
RAC: 0
Message 70635 - Posted: 24 Jun 2011, 23:11:28 UTC

25/06/2011 01:02:23 | rosetta@home | Started download of minirosetta_3.14_windows_intelx86.exe
25/06/2011 01:02:23 | rosetta@home | Started download of minirosetta_graphics_3.13_windows_intelx86.exe

3.14 version ?



ID: 70635 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 759
Credit: 9,986,485
RAC: 3,436
Message 70641 - Posted: 26 Jun 2011, 2:48:54 UTC - in response to Message 70635.  
Last modified: 26 Jun 2011, 2:49:25 UTC

25/06/2011 01:02:23 | rosetta@home | Started download of minirosetta_3.14_windows_intelx86.exe
25/06/2011 01:02:23 | rosetta@home | Started download of minirosetta_graphics_3.13_windows_intelx86.exe

3.14 version ?




Yes, the 3.14 version of the main application program. Looks behind on the screensaver graphics program, though.
ID: 70641 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 4871
Credit: 3,948,842
RAC: 2,294
Message 70643 - Posted: 26 Jun 2011, 5:59:44 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=430776403
casd_sgr145_boinc_3lccA_26.nonlocal.pctid_0.20.tmscore_0.67557._nonlocal_tex_IGNORE_THE_REST_27533_2536_1

Outcome Client error
Client state Compute error
Exit status -529697949 (0xe06d7363)
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7C812AFB

This task chewed up 3.24GB of RAM?
That's insane.
ID: 70643 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 7 · Next

Message boards : Number crunching : Minirosetta 3.14



©2020 University of Washington
http://www.bakerlab.org