minirosetta 2.17

Message boards : Number crunching : minirosetta 2.17

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
TimL

Send message
Joined: 16 Sep 06
Posts: 17
Credit: 15,509,973
RAC: 2
Message 70217 - Posted: 2 May 2011, 10:24:30 UTC

FOLD_N_DOCK_YgaP_D2symm_2_SAVE_ALL_OUT_IGNORE_THE_REST_w_csts_26019_4307_0 ran for over 17 hours (Target run time = 8 hours) then failed with this error.
ID: 70217 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 70221 - Posted: 2 May 2011, 15:36:37 UTC - in response to Message 70210.  
Last modified: 2 May 2011, 16:07:29 UTC

Well guess we will have to wait for the Grad student to wake up


The "no heartbeat" message means the science app and BOINC client lost contact with each other. When the science application doesn't receive the heartbeat (the "I'm alive") message from BOINC it is supposed to exit. As long as it was merely a temporary obstruction and BOINC hasn't actually crashed it should see that the application has stopped, restart it and proceed merrily on its way. Only when it happens repeatedly with a single task (100 times) does BOINC give up, sending that task back and starting a brand new task. If I'm reading correctly the "no heartbeat" messages occurred after you had restarted BOINC and Rosetta was able to successfully complete the task despite them. They may or may not be related to the cause of the error Gregg highlighted and which may have led to a BOINC crash which it couldn't recover from without a restart, thus the long delay until you noticed, restarted, and set BOINC and Rosetta on their merry way again.

You might try to recall what else was running on your computer at the time of the "no heartbeat" messages (22:6:36, 22:7:11, 22:8:47, 22:17:41). Anti-virus, anti-spyware, some other maintenance type scan, indexing? Could be something you started deliberately or could be something running automatically in the background. I don't suppose you started some new process (indexing, say) between 2:38:23 and the time BOINC stopped (which, if BOINC hadn't been running for 13.5 hours when you restarted must have been about 8. Is that right?). That could point to the cause of the crash and, if the process was ongoing (or maybe set to check for changes, like an index or a backup), could also explain the "no heartbeat" messages.


Best,
Snags


When I've seen similar error messages, the Norton Internet Security antivirus program was always running in the background (no good way to shut it off other than uninstalling it). Not sure if that's also why I often see the BOINC Manager program lose contact with the rest of BOINC. Do the other people seeing this problem also use Norton Internet Security?
ID: 70221 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 21 Sep 05
Posts: 55
Credit: 4,216,173
RAC: 0
Message 70227 - Posted: 2 May 2011, 22:01:05 UTC - in response to Message 70221.  


When I've seen similar error messages, the Norton Internet Security antivirus program was always running in the background (no good way to shut it off other than uninstalling it). Not sure if that's also why I often see the BOINC Manager program lose contact with the rest of BOINC. Do the other people seeing this problem also use Norton Internet Security?


Nope, will NOT have Norton on any of my machines.


ID: 70227 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jesse Viviano

Send message
Joined: 14 Jan 10
Posts: 42
Credit: 2,700,472
RAC: 0
Message 70254 - Posted: 5 May 2011, 23:03:56 UTC

I just got a validate error on work unit 420544516. Could someone please investigate why the validator failed here?
ID: 70254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jesse Viviano

Send message
Joined: 14 Jan 10
Posts: 42
Credit: 2,700,472
RAC: 0
Message 70255 - Posted: 6 May 2011, 3:42:10 UTC - in response to Message 70254.  

I just got a validate error on work unit 420544516. Could someone please investigate why the validator failed here?

Oops! That should be result 420544516. The corresponding work unit number is 383771914.
ID: 70255 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 163
Credit: 808,337
RAC: 1
Message 70256 - Posted: 6 May 2011, 5:21:06 UTC

420656625 FOLD_N_DOCK_dagk_D2symm got Validate state Invalid after CPU time 2010.416 run time meant to be 3 hours. corresponding work unit number 420591203 got after Validate state Invalid after CPU time 3843.709 (has debug message)
Have a crunching good day!!
ID: 70256 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SafeAggie

Send message
Joined: 22 Oct 05
Posts: 3
Credit: 458,414
RAC: 0
Message 70272 - Posted: 7 May 2011, 18:42:48 UTC

Validate Error: ProteinG_abinitio_SAVE_ALL_OUT_design_relax_g056_009_26017_78
wuid=382515464
resultid=419989702


Validate Error: ProteinG_abinitio_SAVE_ALL_OUT_design_relax_g056_010_26017_78
wuid=382515501
resultid=419989703
ID: 70272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 70276 - Posted: 7 May 2011, 20:59:48 UTC
Last modified: 7 May 2011, 21:06:00 UTC

Validate error -ProteinG_abinitio_SAVE_ALL_OUT_design_relax_g061_005_26530_180_0
https://boinc.bakerlab.org/rosetta/result.php?resultid=420705463
ID: 70276 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 70280 - Posted: 8 May 2011, 7:09:51 UTC

Error Message: - Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7C812AFB
Wingman also had the same problem with a little longer run time.


Tasks:

FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_9746_0
https://boinc.bakerlab.org/rosetta/result.php?resultid=421054105

FOLD_N_DOCK_2kqt_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26674_1528_0
https://boinc.bakerlab.org/rosetta/result.php?resultid=420870386

FOLD_N_DOCK_dagk_D2symm_SAVE_ALL_OUT_IGNORE_THE_REST_26520_9259_1
https://boinc.bakerlab.org/rosetta/result.php?resultid=420803687

ID: 70280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 70284 - Posted: 8 May 2011, 16:00:24 UTC

Validate error - ProteinG_abinitio_SAVE_ALL_OUT_design_relax_g049_008_26508_177

Both of us that crunched this unit got this error

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=383720482
ID: 70284 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 1,227
Message 70285 - Posted: 8 May 2011, 16:43:08 UTC
Last modified: 8 May 2011, 16:47:27 UTC

Another workunit that appeared to stop using any CPU time at all shortly after a checkpoint, but BOINC thought it was still running for about 2 more days elapsed:

pred_ECH19_lr19a_189_0003_nh.pdb_26473_588_0

However, it eventually decided that it had gone past a time limit and engaged the BOINC debugger. Could there be a problem with the BOINC debugger announcing that it is finished, and the workunit should be marked as ended?

Also, the listing of my results does not appear to contain any information on which version of minirosetta was used. 2.17 is the latest version, so I'm assuming that one.

Not sure if the Tthrottle add-on I'm using to prevent my computer from overheating has any effect on this problem.
ID: 70285 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 70456 - Posted: 31 May 2011, 1:31:47 UTC

Task 426111314 ( lysozyme_var_quota_8_15_noH_SAVE_ALL_OUT_27153_445_0 ) failed immediately on Mac.

ERROR: ERROR: FragmentIO: could not open file q-noHom.frags.15mers.gz
ERROR:: Exit from: src/core/fragment/FragmentIO.cc line: 258
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>
ID: 70456 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 70491 - Posted: 2 Jun 2011, 20:35:39 UTC

Compute error after 3 seconds

lysozyme_var_dis_8_15_SAVE_ALL_OUT_27136_429_0

both of us got the same error

ERROR: ERROR: FragmentIO: could not open file cs-lys.15mers.gz
ERROR:: Exit from: src/core/fragment/FragmentIO.cc line: 258
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=388396144
ID: 70491 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dean Costello

Send message
Joined: 8 Feb 11
Posts: 4
Credit: 21,658,356
RAC: 2,497
Message 70520 - Posted: 8 Jun 2011, 22:55:42 UTC

Hello,
I hate to leave this message because it seems like a problem that has already been answered somewhere, but I can't find it.

Here's the thing: I get the following error on my new iMac running a new version of BOINC (6.12.26)

Wed Jun 8 18:35:23 2011 | rosetta@home | Sending scheduler request: Requested by user.
Wed Jun 8 18:35:23 2011 | rosetta@home | Reporting 27 completed tasks, requesting new tasks for CPU
Wed Jun 8 18:35:23 2011 | | [error] Can't create HTTP response output file sched_reply_boinc.bakerlab.org_rosetta.xml
Wed Jun 8 18:35:23 2011 | rosetta@home | Scheduler request initialization failed: fopen() failed

Any ideas on this?
-
Dean Costello
ID: 70520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 70523 - Posted: 9 Jun 2011, 11:36:27 UTC

From the sound of it, the network is fine and the reply is coming back... but the BOINC Core client is unable to create a new file to write a copy of the response to. So either disk is full, or permissions are not correct to do so.
Rosetta Moderator: Mod.Sense
ID: 70523 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dean Costello

Send message
Joined: 8 Feb 11
Posts: 4
Credit: 21,658,356
RAC: 2,497
Message 70529 - Posted: 9 Jun 2011, 23:02:05 UTC

>From the sound of it, the network is fine and the reply is coming back... but the BOINC Core client is >unable to create a new file to write a copy of the response to.

Thanks for getting back to me. For further information, the Seti and Climate Prediction projects seem to be uploading/downloading as they normally do.

> So either disk is full, or permissions are not correct to do so.

I have a dab less than a TB free, so I don't think that it's a disk error. So, how do I address the permission issue?

Thank you for your help.
-
Dean Costello
ID: 70529 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 70530 - Posted: 9 Jun 2011, 23:50:44 UTC

Dean, I found this thread that describes a possible scenario... but if other projects are working that doesn't sound like the problem. I've never heard of such a thing before. Since it doesn't seen related to the minirosetta v2.17 (the topic of this thread), if the info. in that link doesn't help, please open a new thread on the Number Crunching board to describe and discuss the issue further.
Rosetta Moderator: Mod.Sense
ID: 70530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dean Costello

Send message
Joined: 8 Feb 11
Posts: 4
Credit: 21,658,356
RAC: 2,497
Message 70535 - Posted: 10 Jun 2011, 21:26:13 UTC

Again, thanks for getting back in touch.

I took a look at the referenced thread, and it seems that it is more associated with the PC and system directories than anything else, so I don't think that it is applicable to my mighty iMac. I'll go ahead and open it up to the number crunching crowd and see what happens.

Appreciate the information.
-
Dean Costello
ID: 70535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
beenie210772

Send message
Joined: 20 Jan 06
Posts: 1
Credit: 6,777,038
RAC: 1,385
Message 70545 - Posted: 12 Jun 2011, 11:18:48 UTC

Not sure if this is a problem with the work units or a server problem but i've got 6 units that are failing to report & update, all i keep getting in the log is
Project communication failed: attempting access to reference site
Internet access OK - project servers may be temporarily down.
If it's of any use the offending units are as follows :-

M50_boinc_NH_restraints_abrelax_cs_frags_tex_IGNORE_THE_REST_26671_153491
M50_boinc_NH_restraints_abrelax_cs_frags_tex_IGNORE_THE REST_26671_164636
M50_boinc_NH_restraints_abrelax_cs_frags_tex_IGNORE_THE_REST_26671_172575
heIF5_NTD_boinc_rosetta_cm_abrelax_cs_frags_hari_IGNORE_THE_REST_26734_168415
heIF5_NTD_boinc_rosetta_cm_abrelax_cs_frags_hari_IGNORE_THE_REST_26734_174250
heIF5_NTD_boinc_rosetta_cm_abrelax_cs_frags_hari_IGNORE_THE_REST_26734_180738
thanks in advance for any help

ID: 70545 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rabinovitch
Avatar

Send message
Joined: 28 Apr 07
Posts: 28
Credit: 5,439,728
RAC: 0
Message 70560 - Posted: 16 Jun 2011, 2:08:10 UTC

Please someone tell why this application create 4 (!) processes (under Linux) with memory consumption same as one process under Windows? Thus it needs 4 times more RAM space under Linux to process the WU comparing to Windows! It's very prodigally and I want to solve this bug ('cause it can't be a legal option by design).
From Siberia with love!
ID: 70560 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : minirosetta 2.17



©2024 University of Washington
https://www.bakerlab.org