Minirosetta v1.47 bug thread.

Message boards : Number crunching : Minirosetta v1.47 bug thread.

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next

AuthorMessage
stewjack

Send message
Joined: 23 Apr 06
Posts: 39
Credit: 95,871
RAC: 0
Message 58166 - Posted: 24 Dec 2008, 23:12:52 UTC - in response to Message 58163.  

Hi.

I have this task at the moment running, it's odd. This morning when i restarted

the ... task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147
pete.


I have had that happen three times during the last 4 or 5 days. I didn't report it because technically
such actions are not prohibited. The tasks complete and grant credit.
However; I have set my tasks length to 2 hours for now,
and these task run well over that time.

NOTE: I have checkpoint logging turned on!

ALL TIMES APPROX.

4 hours with no ckeckpoints after 40 min
cc_nonideal_3_5_nocst4_hb_t374__IGNORE_THE_REST_2FCKA_10_5832_14_0

3.5 hours with no checkpoints after 35 min
cc2_1_8_mammoth_mix_cen_cst_hb_t332__IGNORE_THE_REST_1V2XA_7_5888_15_0

3 hours with no checkpoints after 50 min
cc_nonideal_0_6_nocst4_hb_t313__IGNORE_THE_REST_1GOJA_10_5910_16_0

NOTE: On the last WU I noticed that when I restarted the task,
well into the no checkpointing period -
checkpointing restarted for a short period of time!




ID: 58166 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Stacey Baird
Avatar

Send message
Joined: 11 Apr 06
Posts: 19
Credit: 74,745
RAC: 0
Message 58167 - Posted: 25 Dec 2008, 0:42:17 UTC - in response to Message 57902.  

HoHo kids!

We've got a new minirosetta version, with - you've guessed it - more bug fixes ! Woo!

Please report remaining issues here - that would be grand :)


Hello, I don't know if this is a bug AND I am not one to complain about receiving credit, however, I was very surprised to receive so much credit compared to claimed credit. Is the result below likely?

216467986
Name cc_nonideal_2_2_nocst4_hb_t297__IGNORE_THE_REST_1YZFA_4_6046_19_0
Workunit 197278592
Created 23 Dec 2008 6:24:21 UTC
Sent 23 Dec 2008 7:45:54 UTC
Received 24 Dec 2008 15:54:32 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 947263
Report deadline 2 Jan 2009 7:45:54 UTC
CPU time 5719.655
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time
failed to create shared mem segment
CreateSemaphore failure! Cannot create semaphore!

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time

CreateFile error 32 when trying set file time
======================================================
DONE :: 1 starting structures 5719.56 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>

Validate state Valid
Claimed credit 14.4476221738839
Granted credit 41.0260851670465
application version 1.47
ID: 58167 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 58168 - Posted: 25 Dec 2008, 5:21:58 UTC - in response to Message 58163.  

Hi.

I have this task at the moment running, it's odd. This morning when i restarted

the system Boinc was showing 5hrs,4mins completed, when the task got it's turn to

run it dropped back to 1hr,33mins and showing 2 models, it would have done more

than two in the five hours!

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=197257513

Thu 25 Dec 2008 08:42:56 EST|rosetta@home|Restarting task cc_nonideal_1_8_nocst4_hb_t303__IGNORE_THE_REST_1FEZA_6_6019_17_0 using minirosetta version 147

pete.



Well still looks odd to me, ended up taking 7hrs, 11min plus the 3 and a half

hours lost on restarting. I have a six hour R/T set and it still only did 4 models.

See below.

# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 25890.1 cpu seconds
This process generated 4 decoys from 4 attempts



ID: 58168 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DaveSun

Send message
Joined: 3 May 07
Posts: 5
Credit: 200,480
RAC: 0
Message 58170 - Posted: 25 Dec 2008, 15:49:57 UTC - in response to Message 58157.  

I found this WU stalled after 15 hrs. I suspended the task and then reenabled it later. After it started again it stalled at the same point. I looked at the box and it had a popup saying that it had a C++ runtime error that had asked to be shutdown in an unusual way.

STDERR OUT

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# cpu_run_time_pref: 86400

</stderr_txt>
]]>



Had This WU this morning with the same error. It ran for 7 hours before stalling. Both are vanilla type. I still have one more of these in progress, it is currently at 21 hours and so far looks good.
ID: 58170 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 58172 - Posted: 25 Dec 2008, 23:37:52 UTC

Hi.

Here's another one doing strange things, when i shutdown last night it had run for 6hrs,30min and had done 18 models, when it restarted it went back to 5hrs, 26min and showing 18 models, it then ran to 6hrs, 18min and still only 18 models!
Still odd i haven't seen this before, the same type of task.

Fri 26 Dec 2008 09:03:52 EST|rosetta@home|Restarting task cc_nonideal_1_3_nocst4_hb_t306__IGNORE_THE_REST_1AZVA_6_5992_27_0 using minirosetta version 147

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=197386767

# cpu_run_time_pref: 21600
======================================================
DONE :: 1 starting structures 22718.4 cpu seconds
This process generated 18 decoys from 18 attempts
======================================================

pete.


ID: 58172 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Stacey Baird
Avatar

Send message
Joined: 11 Apr 06
Posts: 19
Credit: 74,745
RAC: 0
Message 58173 - Posted: 26 Dec 2008, 4:35:26 UTC

I am having much the same problems with stops, starts, incomprehensible progress (if any progress) reports, strange error reports, stalling, misrepresentation of time budgeting in the Tasks function and other weirdness.

Minirosetta v1.47 wastes too much time and steals processing time from other processing jobs that actually work.

I suspect that part of the problem is programmers and others being on Christmas break and not being available for problem solving.

As a result I have suspended Rosetta processing until at least January 3rd pending cleanup of the issues.
ID: 58173 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 58175 - Posted: 26 Dec 2008, 13:19:46 UTC - in response to Message 58166.  


NOTE: I have checkpoint logging turned on!

ALL TIMES APPROX.

4 hours with no ckeckpoints after 40 min
cc_nonideal_3_5_nocst4_hb_t374__IGNORE_THE_REST_2FCKA_10_5832_14_0

3.5 hours with no checkpoints after 35 min
cc2_1_8_mammoth_mix_cen_cst_hb_t332__IGNORE_THE_REST_1V2XA_7_5888_15_0

3 hours with no checkpoints after 50 min
cc_nonideal_0_6_nocst4_hb_t313__IGNORE_THE_REST_1GOJA_10_5910_16_0

NOTE: On the last WU I noticed that when I restarted the task,
well into the no checkpointing period -
checkpointing restarted for a short period of time!


This is pointing to a problem with checkpointing in the FoldCst protocol. I'll put this high on the todo list for the 1.48 release.
The runtimes look very reasonable though! I'm afraid making a single decoy shorter than 3-4 hours is not always possible - what kind of machine was this on ?

Mike


http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 58175 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
stewjack

Send message
Joined: 23 Apr 06
Posts: 39
Credit: 95,871
RAC: 0
Message 58177 - Posted: 26 Dec 2008, 14:55:13 UTC - in response to Message 58175.  


The runtimes look very reasonable though! I'm afraid making a single decoy shorter than 3-4 hours is not always possible


That would make sense. Normally my WU run time is set to 4 hours.


- what kind of machine was this on ?


Compaq Presario 6029
AMD Athalon XP 2100 (1.7 GHZ)
Windows XP Home ( BOINC v 6.2.19 )
RAM: 768 MB
VIDEO CARD: Radeon 9250 128MB
Dial-up: USRobotics Controller Modem
ID: 58177 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 58180 - Posted: 27 Dec 2008, 9:52:34 UTC

serious credit issue here:
cc2_1_8_mammoth_fa_cst_hb_t303__IGNORE_THE_REST_2AH5A_4_6138_17_0
Claimed credit 106.166115188458
Granted credit 74.8691857584611

That is worse than the other mammoth task i had which had something like a 10 point difference. It also ran over my preferences of time. See long running tasks thread.
ID: 58180 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 21 Sep 05
Posts: 55
Credit: 4,216,173
RAC: 0
Message 58183 - Posted: 27 Dec 2008, 11:26:21 UTC - in response to Message 58099.  

After clean runs of memtest86+ 2.10 and prime95 for linux and I can no longer get decent results out of prime95 even though memtest86+ 2.10 will run fine.

As you'd most likely expect I'm putting the errors below down to hardware !!

Don't know if it's the CPU or more likely the mainboard northbridge. Have a newer CPU on order to rule that out.

Have removed said machine from my "farm".

Cheers and Happy Christmas and a computational bug free New Year


CPU type GenuineIntel
Intel(R) Pentium(R) 4 CPU 2.60GHz [Family 15 Model 2 Stepping 9]
Number of CPUs 2
Operating System Linux
2.6.24-22-generic

process exited with code 193 (0xc1, -63)
Stack trace (22 frames):
[0x8b979b7]
[0x8bc20b0]
[0xb7f03420]
[0x83c53bc]
[0x84356a0]
[0x83c4fa3]
[0x83ba6f8]
[0x85c2f4e]
[0x80cf524]
[0x80de98f]
[0x83376f7]
[0x8337100]
[0x8243364]
[0x82a246c]
[0x818e15a]
[0x819bae3]
[0x819b3aa]
[0x8127771]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

https://boinc.bakerlab.org/rosetta/result.php?resultid=215801702

process exited with code 193 (0xc1, -63)
SIGSEGV: segmentation violation
Stack trace (20 frames):
[0x8b979b7]
[0x8bc20b0]
[0xb7fa5420]
[0x83c4fa3]
[0x83ba6f8]
[0x85c2f4e]
[0x80cf1ff]
[0x80de98f]
[0x83376f7]
[0x8337100]
[0x8243364]
[0x82a246c]
[0x818e15a]
[0x819bae3]
[0x819b3aa]
[0x8127771]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

https://boinc.bakerlab.org/rosetta/result.php?resultid=215414530

process exited with code 193 (0xc1, -63)
SIGSEGV: segmentation violation
Stack trace (23 frames):
[0x8b979b7]
[0x8bc20b0]
[0xb7f48420]
[0x8ace23a]
[0x84348d3]
[0x8ace5f6]
[0x8acd739]
[0x83b1c55]
[0x862a631]
[0x83f65af]
[0x80cece6]
[0x80de98f]
[0x82c37e4]
[0x82b897a]
[0x82c16c1]
[0x818d6ee]
[0x819bae3]
[0x819b3aa]
[0x8127771]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

https://boinc.bakerlab.org/rosetta/result.php?resultid=215035006

What's going on with the Rosetta Linux App ? Sometimes it works , sometimes it's duff ? Machine NOT overclocked in the slightest

Cheers




ID: 58183 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rifleman

Send message
Joined: 19 Nov 08
Posts: 17
Credit: 139,408
RAC: 0
Message 58188 - Posted: 27 Dec 2008, 22:52:28 UTC

I am having problems with another WU that ran fine up to 99 percent and gets stalled. I let one run for 37 hours until watchdog terminated it. I have preferences set for 12 hours so that is fine. The granted credit is what bothered me. : https://boinc.bakerlab.org/rosetta/result.php?resultid=216862173
I am debating cancelling the WU that is presently doing the same thing as wasting all that CPU time for 2 decoys seems like--well---a waste!!
https://boinc.bakerlab.org/rosetta/result.php?resultid=217161601
ID: 58188 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 58192 - Posted: 28 Dec 2008, 0:21:56 UTC - in response to Message 58188.  

I am having problems with another WU that ran fine up to 99 percent and gets stalled. I let one run for 37 hours until watchdog terminated it. I have preferences set for 12 hours so that is fine. The granted credit is what bothered me. : https://boinc.bakerlab.org/rosetta/result.php?resultid=216862173
I am debating cancelling the WU that is presently doing the same thing as wasting all that CPU time for 2 decoys seems like--well---a waste!!
https://boinc.bakerlab.org/rosetta/result.php?resultid=217161601


Where did it seem to get stalled at - about 10 minutes left to go? If so, that's what typically happens when a minirosetta workunit goes out with a serious underestimate of the time required to run it. When I had one like that, a few versions ago, I let it finish (in about 4 times the time I set as preference) and at least got some credit for it, but not much more than typical for workunits that actually finished in the estimated time. At about 10 minutes left to go, the estimated time calculations get messed up, but not the calculations leading to the desired results.

ID: 58192 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rifleman

Send message
Joined: 19 Nov 08
Posts: 17
Credit: 139,408
RAC: 0
Message 58193 - Posted: 28 Dec 2008, 0:58:07 UTC

Hi Robert. Yeah----it stopped at about 10 minutes to go-----and stayed that way for 25 hours---lol. Watchdog terminated it.
I aborted another after 18 hours in. It was the same type protein as the first one. I have 2 more being crunched at the moment and am watching to see how they do after 12 hours in.
Task ID 216862173
Name 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_17673_0
Workunit 197639536
Created 25 Dec 2008 6:09:31 UTC
Sent 25 Dec 2008 7:37:31 UTC
Received 27 Dec 2008 5:01:41 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 948562
Report deadline 4 Jan 2009 7:37:31 UTC
CPU time 134234.2
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 43200
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 134233 seconds. Greater than 3X preferred time: 43200 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>


Validate state Valid
Claimed credit 561.58588373264
Granted credit 117.029798631356
application version 1.47
ID: 58193 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 21 Sep 05
Posts: 55
Credit: 4,216,173
RAC: 0
Message 58194 - Posted: 28 Dec 2008, 8:28:56 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=217325144

Nearly 16 hrs in when I spotted it and now it reports, after a manual abort, it has done 0 CPU time ?!?!


ID: 58194 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 58195 - Posted: 28 Dec 2008, 9:33:57 UTC

guys,
don't forget to also post this info in the "Report long-running models here" thread.
ID: 58195 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 58197 - Posted: 28 Dec 2008, 13:57:49 UTC

Somewhere below the question was raised if the "Lock file" error has been fixed. It has not. If you look at this Computer you can see that I have several.

It is not at all clear why this happened.

As you can see it is a 4 Core processor with HT giving 8 virtual processors and I know that at one point I had at least 4 tasks running at the same time. Could this be a concurrency problem? At any rate this is a new machine in the prime of its existence in that it is just over a week old. It is run 24/7 and I have been running about 6-8 projects on the machine and I am not seeing errors like this on other projects. Heck, even GPU Grid is running reasonably well ...

The log files do not record the start time of the processing so you cannot tell for sure if that is the problem here. I still have a few tasks to go and I will run them to completion and see if I get more of these errors in the remaining tasks I have.

I note that my Mac Pro, also with 8 processors has not had this error, but, the project loading on that computer is such that I can't recall an instance where I had more than one Rosetta task running at the same time.

Looking at my other computers, all are multi-processor with at least 4 CPUs and I cannot see this error on any of those machines. I have two tasks running on the i7 right now so I will see if they will die with a collision. the tasks are cc2_1_8_native_cen_cst_hb_t373 and cc2_1_8_native_fa_cst_hb_t373 ...

I have been ignoring Rosetta so I cannot say that I know what the alphabet soup that makes up the task id means (if anything) so I can't tell if there is something common in the actual tasks or not ...

I just find it disappointing that this error surfaced so late in processing. One would think that the error would surface immediately.
ID: 58197 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 58202 - Posted: 28 Dec 2008, 17:17:19 UTC

Since my last post I have completed two tasks successfully on this machine. I have two more in the queue and they are running now. So, by the time you read this they should probably have run to completion or failure. Watching my 8 CPU systems for some time now I have noted that, in general, I never seem to have more than 2 Rosetta tasks running at the same time due to other projects.

On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time?

Well, I put the i7 into NNT until I can get an answer on this. I hate to waste my time crashing tasks ... or trying to baby sit the machines ...
ID: 58202 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ian_D

Send message
Joined: 21 Sep 05
Posts: 55
Credit: 4,216,173
RAC: 0
Message 58205 - Posted: 28 Dec 2008, 19:46:19 UTC
Last modified: 28 Dec 2008, 19:55:30 UTC

* sigh *

https://boinc.bakerlab.org/rosetta/result.php?resultid=217461782

<core_client_version>6.2.15</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
terminate called after throwing an instance of 'std::bad_alloc'
what(): St9bad_alloc
SIGABRT: abort called
Stack trace (27 frames):
[0x8b979b7]
[0x8bc20b0]
[0xb7f22420]
[0x8c24ca4]
[0x8c12c5b]
[0x8c10261]
[0x8c10296]
[0x8c0fe43]
[0x8c0f86c]
[0x8a88ba5]
[0x8559c48]
[0x83e8bc3]
[0x87f80df]
[0x87dc3c7]
[0x80de412]
[0x80d0686]
[0x80d0b2e]
[0x80c88b9]
[0x80de971]
[0x80d7d76]
[0x8064271]
[0x8117277]
[0x8127c00]
[0x8129a1a]
[0x804b9c8]
[0x8c1dbac]
[0x8048111]

Exiting...

</stderr_txt>
]]>

and

https://boinc.bakerlab.org/rosetta/result.php?resultid=217459230

<core_client_version>6.2.15</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
**********************************************************************
Rosetta is going too long. Watchdog is ending the run!
CPU time: 26682.9 seconds. Greater than 3X preferred time: 7200 seconds
**********************************************************************
called boinc_finish

</stderr_txt>
]]>


ID: 58205 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 58215 - Posted: 29 Dec 2008, 2:11:33 UTC - in response to Message 58202.  
Last modified: 29 Dec 2008, 2:38:47 UTC

On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time?

Well, I put the i7 into NNT until I can get an answer on this. I hate to waste my time crashing tasks ... or trying to baby sit the machines ...


I've noticed that for the more memory demanding BOINC projects, there often is a limit on how many incarnations will run at the same time, especially if you enable the Leave In Memory option but make no effort to increase the amount of swap space they can use. Before I increased the upper limit on swap space on my machine, only one minirosetta workunit would run at a time on my dual CPU core machine; now I often see a minirosetta workunit running on each of the CPU cores.

Adding more physical memory also helps, but I had previously increased it to the limit of what my machine can handle (2 GB).
ID: 58215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 58216 - Posted: 29 Dec 2008, 2:41:56 UTC - in response to Message 58215.  
Last modified: 29 Dec 2008, 3:05:54 UTC

On the i7 about the time of the failures I know there was a period where there were 4 or more Rosetta tasks running at the same time. Perhaps there is an upper limit on the number of simultaneous incarnations that can be run at the same time?

Well, I put the i7 into NNT until I can get an answer on this. I hate to waste my time crashing tasks ... or trying to baby sit the machines ...


I've noticed that for the more memory demanding BOINC projects, there often is a limit on how many incarnations will run at the same time, especially if you enable the Leave In Memory option but make no effort to increase the amount of swap space they can use. Before I increased the upper limit on swap space on my machine, only one minirosetta workunit would run at a time on my dual CPU core machine; now I often see a minirosetta workunit running on each of the CPU cores.


According to my Task manager my peak was 3.9 G with limit 5G so, I did not even get close. I have 3G normal RAM (well, 6 actually, but XP can only "see" 3 G) so ...

Well, I will try to increase the swap file, but, have suspended work on this machine till the project says something... over half the tasks failed with this one error and I am still waiting to see what happens to the last task ... it has been running with 11 min to go for a couple hours now ... if the % Complete was not slowly rising I would have killed it by now ... the main reason I am letting it run is that curiosity overwhelms me as to if it is going to fail with the same error after eating up 10 or more hours of my time or not ...

Oh, man, this is worse... I had nearly 10 hours on the clock. Changed the memory settings to increase the possible size of the swap file (even though it had 2G never used) and after a reboot, the task ended with 8 hours clock time. It looks like it is valid ... but that tells me that I just wasted nearly 2 hours on a task that should have ended ...

{edit add} The tasks that ended badly *MAY* have all been suspended. I cannot say for sure that they were or not. The *MAY* have been. My setting for switiching between tasks is 720 min (12 hours) to try to force most applications to finish before switching ... it is my way of trying to provide best results ... and with 4 plus cores it mostly works. But, I did notice that the several of the Rosetta tasks did get suspended but I did not note which ones ... so more data to ponder if someone is actually going to look at this problem.{/edit} corrected time
ID: 58216 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 10 · Next

Message boards : Number crunching : Minirosetta v1.47 bug thread.



©2024 University of Washington
https://www.bakerlab.org