Problems with Minirosetta v1.54

Message boards : Number crunching : Problems with Minirosetta v1.54

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 15 · Next

AuthorMessage
Profile Gavin Shaw
Avatar

Send message
Joined: 1 Feb 07
Posts: 10
Credit: 506,456
RAC: 0
Message 59228 - Posted: 1 Feb 2009, 23:07:15 UTC

Got this one a day or so ago. Not sure if it is a failure/error.

224812655

Never surrender and never give up. In the darkest hour there is always hope.

ID: 59228 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 59233 - Posted: 2 Feb 2009, 4:19:31 UTC

Hi Mike.

I did see somewhere that you said something about large upload file size, i think

this is one that got away. ;)

99 models in 4hrs, 26min and result file of 8.32mb.

_CAPRI17_T39_2_.sjf_br_one_docking.protocol__6483_19318_1.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=205403421

pete.

ID: 59233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3545
Credit: 0
RAC: 0
Message 59236 - Posted: 2 Feb 2009, 5:30:33 UTC
Last modified: 2 Feb 2009, 5:35:16 UTC

Peter, the potential for large output files is why Mike changed it to exit after 99 models. That lets the task report back that it\'s running through models like candy and then they can weigh that before releasing more similar tasks.
Rosetta Moderator: Mod.Sense
ID: 59236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 59237 - Posted: 2 Feb 2009, 6:00:42 UTC - in response to Message 59236.  
Last modified: 2 Feb 2009, 6:02:21 UTC

Peter, the potential for large output files is why Mike changed it to exit after 99 models. That lets the task report back that it\'s running through models like candy and then they can weigh that before releasing more similar tasks.


Hi.

Just as well it did finish after 99 i would hate to see the file size after

12 or 24 hours! :) I just returned another one the same size.

pete.
ID: 59237 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 9,306,724
RAC: 3,432
Message 59238 - Posted: 2 Feb 2009, 6:22:36 UTC - in response to Message 59225.  

I\'ve had a couple of these \"Validate Errors\" recently:

Mini 1.47 Task 223871308
and
Mini 1.54 Task 224694361

Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.

Is it something I did, a bug or just one of those things?


I only checked the 1.54-task. You have a runtime-preference of 3 hours. This one ran for 7 hours, no finished models. I\'d say the watchdog, which aborts tasks running longer than intended, cut in. This is one for the long-running tasks thread.
ID: 59238 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 59239 - Posted: 2 Feb 2009, 13:39:00 UTC - in response to Message 59238.  

I\'ve had a couple of these \"Validate Errors\" recently:

Mini 1.47 Task 223871308
and
Mini 1.54 Task 224694361

Both ended with exit code: 0 (0x0) and seemed to run successfully, but with no credit given.

Is it something I did, a bug or just one of those things?

I only checked the 1.54-task. You have a runtime-preference of 3 hours. This one ran for 7 hours, no finished models. I\'d say the watchdog, which aborts tasks running longer than intended, cut in. This is one for the long-running tasks thread.

Thanks for copying here - I thought it was just a problem with the validator (the error message being the clue). You\'re right, there\'s no \"Done\" section after the first model starts until the boinc_finish, which is odd, but no mention of the watchdog cutting in, even though it does run a long time. But on the 1.47 WU there are 3 models done, so I\'m not entirely convinced it\'s the same thing.

Usually long-running jobs get a default credit of 80, don\'t they? Looks like I missed out all ways. Oh well...
ID: 59239 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
falingtrea

Send message
Joined: 8 Aug 07
Posts: 1
Credit: 646,281
RAC: 758
Message 59240 - Posted: 2 Feb 2009, 16:25:44 UTC

Just got this error trying to perform an update:

2/2/2009 10:05:58 AM|rosetta@home|Sending scheduler request: Requested by user
2/2/2009 10:05:58 AM|rosetta@home|(not requesting new work or reporting completed tasks)
2/2/2009 10:06:03 AM|rosetta@home|Scheduler RPC succeeded
2/2/2009 10:06:03 AM|rosetta@home|Message from server: Server error: can\'t attach shared memory
2/2/2009 10:06:03 AM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
2/2/2009 10:06:03 AM|rosetta@home|Reason: project is down

Server is up according to the webpage. One task was updated as complete.
ID: 59240 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 4871
Credit: 3,647,659
RAC: 669
Message 59245 - Posted: 2 Feb 2009, 22:02:36 UTC

there is something odd going on with the graphics of lr5_D_score12_rlbd_2hsh_IGNORE_THE_REST_DECOY_6246_424_0 the plot disappears completely at times and the accepted energy does the same at times. then they reappear at times. all seems to depend on the energy value of the moment. as far as i know this is not normal.
ID: 59245 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TeAm Enterprise
Avatar

Send message
Joined: 28 Sep 05
Posts: 18
Credit: 24,862,633
RAC: 7,257
Message 59249 - Posted: 3 Feb 2009, 3:44:56 UTC

I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.
ID: 59249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 4871
Credit: 3,647,659
RAC: 669
Message 59253 - Posted: 3 Feb 2009, 10:11:54 UTC - in response to Message 59249.  

I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.



what version of boinc manager are you using?
it looked like you were using 5.10.45 which is quite old.
ID: 59253 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1449
Credit: 5,776,642
RAC: 0
Message 59254 - Posted: 3 Feb 2009, 12:41:38 UTC - in response to Message 59249.  

I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.


Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.
ID: 59254 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 59256 - Posted: 3 Feb 2009, 12:49:09 UTC

Compute error, though it looks more like a zip error ...


process exited with code 1 (0x1, -255)

Watchdog active.
Hbond tripped.

ERROR: dis==0 in pairtermderiv!
ERROR:: Exit from: src/core/scoring/methods/PairEnergy.cc line: 338
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Not sure what to make of this error ... happened on the Mac Pro ...
ID: 59256 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TeAm Enterprise
Avatar

Send message
Joined: 28 Sep 05
Posts: 18
Credit: 24,862,633
RAC: 7,257
Message 59258 - Posted: 3 Feb 2009, 15:06:00 UTC - in response to Message 59253.  

Using version 6.4.5 which I downloaded and installed about 6 days ago.


what version of boinc manager are you using?
it looked like you were using 5.10.45 which is quite old.[/quote]
ID: 59258 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile TeAm Enterprise
Avatar

Send message
Joined: 28 Sep 05
Posts: 18
Credit: 24,862,633
RAC: 7,257
Message 59259 - Posted: 3 Feb 2009, 15:11:39 UTC - in response to Message 59254.  

That fixed it! Thanks, my duration was set at 55+.


I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.


Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.

ID: 59259 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile cenit

Send message
Joined: 1 Apr 07
Posts: 13
Credit: 1,630,287
RAC: 0
Message 59261 - Posted: 3 Feb 2009, 17:12:07 UTC - in response to Message 59240.  

Just got this error trying to perform an update:

2/2/2009 10:05:58 AM|rosetta@home|Sending scheduler request: Requested by user
2/2/2009 10:05:58 AM|rosetta@home|(not requesting new work or reporting completed tasks)
2/2/2009 10:06:03 AM|rosetta@home|Scheduler RPC succeeded
2/2/2009 10:06:03 AM|rosetta@home|Message from server: Server error: can\'t attach shared memory
2/2/2009 10:06:03 AM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
2/2/2009 10:06:03 AM|rosetta@home|Reason: project is down

Server is up according to the webpage. One task was updated as complete.


you have to wait and it will correct by itself.
Maybe it is a long time from your last rosetta WU... during this time the project changed its web address and so boinc need to re-fetch master file. Leave it alone and in 24 hour max it will redownload it and resume working!
ID: 59261 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1449
Credit: 5,776,642
RAC: 0
Message 59266 - Posted: 3 Feb 2009, 23:48:58 UTC - in response to Message 59259.  

That fixed it! Thanks, my duration was set at 55+.


I keep running out of work on my Q6600. I think I found out why ... Boinc task manager shows that a WU will take 100+ hours of CPU time instead of the <3 hours it actually takes. Thus the manager asks for 400,000 seconds of work and gets one WU.

For some reason when a core finishes a WU and asks for another it gets nothing for a long time thus sitting idle.

Right now 3 cores are running:

Core 1 is 82.4% done at 2 hr 28 min with 1 hr 47 min left
Core 2 is 34.1% done at 1 hr 4 min with 72 hr 3 min left
Core 3 is 15.5% done at 0 hr 34 min with 150 hr 48 min left
Core 4 is idle.


Sounds like your Duration Correction Factor is waaaay off. You can wait for Boinc to fix it, and it will all by itself. Or you can shut down Boinc and manually edit the file client_state.xml. Do this thru notepad or whatever, and go down until you find this Project and the line that says:
<duration_correction_factor>0.705998</duration_correction_factor>
That is a copy of my line, yours will have the numbers be like 70.?????????? or whatever. Change it to 1.000000, save the file and then restart Boinc. Boinc will then use those new numbers and get both new work and recalculate how long it will take to crunch your existing units. If you do a right click on the file name the EDIT option is listed, use that to edit the file. Do NOT change anything else in the file. The line you are looking for is near the top, mine was only 65 lines down from the top.


Yea for some reason this has happened ALOT lately.
ID: 59266 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 9,306,724
RAC: 3,432
Message 59272 - Posted: 4 Feb 2009, 6:22:55 UTC

That could be related to the BOINC version (6.4.5 and higher). The complaints about the RDCF being completely off are usually coming from people having installed it. A not uncommon opinion is that version 6.4.5 was made the recommended version too hasty and done to get the CUDA capabilities out.
ID: 59272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 4871
Credit: 3,647,659
RAC: 669
Message 59273 - Posted: 4 Feb 2009, 9:08:55 UTC
Last modified: 4 Feb 2009, 9:09:38 UTC

lr6_E_score12_rlbd_1e6i_IGNORE_THE_REST_DECOY_6254_236_1
ERROR:: Exit from: ..\\..\\src\\protocols\\checkpoint\\CheckPointer.cc line: 87
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish
ID: 59273 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 4871
Credit: 3,647,659
RAC: 669
Message 59274 - Posted: 4 Feb 2009, 9:11:24 UTC

lr5_D_hybrid_rlbd_1bmg_IGNORE_THE_REST_DECOY_6250_424_0
Initializing options.... ok
ERROR: Option file open failed for: relax_options_lr5_D_hybrid_mtyka

ID: 59274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
KC0ISW

Send message
Joined: 28 Sep 05
Posts: 2
Credit: 58,926
RAC: 0
Message 59278 - Posted: 4 Feb 2009, 12:43:32 UTC - in response to Message 59274.  

http://boinc.bakerlab.org/rosetta/result.php?resultid=226103545
ID: 59278 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 15 · Next

Message boards : Number crunching : Problems with Minirosetta v1.54



©2019 University of Washington
http://www.bakerlab.org