Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · 33 . . . 55 · Next

AuthorMessage
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 0
Message 76058 - Posted: 22 Sep 2013, 7:09:58 UTC - in response to Message 76054.  

endo_ae__ results cause (and suffer from) BOINC heartbeat problems and they do not checkpoint properly on one of my boxes, my guess is that they have very high RAM requirements (my internet PC with only 2GB RAM, having Firefox nearly always running, one Rosetta task plus 3 projects with very low RAM requirements). They should probably be limited to boxes with more than 3GB physical RAM.

Unfortunately I could not catch/spy on one just before it crashed, so the RAM thing is only a guess. After the crash the RAM history is lost with the PID so I cannot check the maximum usage. Other result types seem not to be affected.


Indeed. The endo_ae tasks are terrible:
1. The first checkpoint takes a number of hours.
2. I have at least three which have crashed after a few minutes.
3. The credit from them is poor. In one example over 8 hours for only 20 points.


Yes I'm having many errors in these tasks, they are reported and "Validate error" and all the copies sent end with the same error in all the computers.

I do not think it is caused by lack of RAM because my PCs has plenty of it (18 or 32 GB).

The log looks this way always (extract) when starts to report errors:

================
....
ERROR: can't open file: minirosetta_database//sampling/filtered.vall.dat.2006-05-05
ERROR:: Exit from: src/core/fragment/picking_old/vall/vall_io.cc line: 63
# cpu_run_time_pref: 7200

ERROR: aFrame->nr_frags()
ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197

ERROR: aFrame->nr_frags()
ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197

"repeated 98 times"

ERROR: aFrame->nr_frags()
ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197
======================================================
DONE :: 99 starting structures 1201 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

ID: 76058 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 76298 - Posted: 19 Dec 2013, 0:51:36 UTC - in response to Message 76058.  
Last modified: 19 Dec 2013, 0:53:46 UTC

endo_ae__ results cause (and suffer from) BOINC heartbeat problems and they do not checkpoint properly on one of my boxes, my guess is that they have very high RAM requirements (my internet PC with only 2GB RAM, having Firefox nearly always running, one Rosetta task plus 3 projects with very low RAM requirements). They should probably be limited to boxes with more than 3GB physical RAM.

Unfortunately I could not catch/spy on one just before it crashed, so the RAM thing is only a guess. After the crash the RAM history is lost with the PID so I cannot check the maximum usage. Other result types seem not to be affected.


Indeed. The endo_ae tasks are terrible:
1. The first checkpoint takes a number of hours.
2. I have at least three which have crashed after a few minutes.
3. The credit from them is poor. In one example over 8 hours for only 20 points.


Yes I'm having many errors in these tasks, they are reported and "Validate error" and all the copies sent end with the same error in all the computers.

I do not think it is caused by lack of RAM because my PCs has plenty of it (18 or 32 GB).

The log looks this way always (extract) when starts to report errors:

================
....
ERROR: can't open file: minirosetta_database//sampling/filtered.vall.dat.2006-05-05
ERROR:: Exit from: src/core/fragment/picking_old/vall/vall_io.cc line: 63
# cpu_run_time_pref: 7200

ERROR: aFrame->nr_frags()
ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197

ERROR: aFrame->nr_frags()
ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197

"repeated 98 times"

ERROR: aFrame->nr_frags()
ERROR:: Exit from: src/core/fragment/FragSet.cc line: 197
======================================================
DONE :: 99 starting structures 1201 cpu seconds
This process generated 99 decoys from 99 attempts
======================================================
BOINC :: WS_max 0

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish



I managed to muddle through this series of tasks with a 4gigs of physical memory and 2 gigs of virtual memory. I got only 1 decoy done in 4hrs on one task and got minimal credit for it. You almost need a desktop super computer to handle those tasks.
ID: 76298 · Rating: 0 · rate: Rate + / Rate - Report as offensive
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 76327 - Posted: 7 Jan 2014, 2:06:37 UTC

This task named d6587_ab_4Jan2014_117685_485_0 (Task id 628550437) gave an early computation error.

A lengthy error log looked like the following repeated multiple times

BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
command: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.48_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro -frag3 d6587.200.3mers -in:file:native 52nc2_dis20-68.des.pdb -silent_gz 1 -frag9 d6587.200.9mers -out:file:silent default.out ex1 -abinitio::rsd_wt_loop 0.5 -relax::default_repeats 15 -abinitio::use_filters false -abinitio::increase_cycles 10 -abinitio::rsd_wt_helix 0.5 -abinitio::rg_reweight 0.5 -in:file:boinc_wu_zip d6587.ab.4Jan2014.zip -out:file:silent default.out -silent_gz -mute all -nstruct 10000 -cpu_run_time 10800 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2952613
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
ERROR: ERROR: Unused "free" argument specified: ex1
[
ID: 76327 · Rating: 0 · rate: Rate + / Rate - Report as offensive
AMDave

Send message
Joined: 16 Dec 05
Posts: 35
Credit: 12,576,896
RAC: 0
Message 76336 - Posted: 9 Jan 2014, 18:57:49 UTC

Installed v7.2.33, running smoothly for 10+ days, then @Jan 06, there's been over 90% rate of this message:

"Task ... exited with zero status but no 'finished' file", followed by
"If this happens repeatedly you may need to reset the project."

I reset the project, then changed some preferences (I had planned these changes anyway):
Target CPU run time: from 4hrs to 6hrs
Disk and memory usage > Use at most: less than 1Gb to 2Gb
Disk and memory usage > Write to disk at most every: 70 sec to 90 sec


The issue remains. Interestingly, the Event Log shows the error message occurring every 3 hours (almost to the minute).

I haven't noticed any similarities in the WU names. Is there a file that Rosetta is looking for and not finding, a flawed batch of WUs, or other cause?
Suggestions please.
ID: 76336 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1996
Credit: 9,653,827
RAC: 7,305
Message 76337 - Posted: 9 Jan 2014, 22:35:03 UTC

Please, restart Ralph server....
ID: 76337 · Rating: 0 · rate: Rate + / Rate - Report as offensive
AMDave

Send message
Joined: 16 Dec 05
Posts: 35
Credit: 12,576,896
RAC: 0
Message 76338 - Posted: 11 Jan 2014, 18:03:18 UTC - in response to Message 76337.  

Please, restart Ralph server....


Issue remains. This particular WU has restarted 3 TIMES, at 3 hour intervals.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=572101126

If the above suggestion actually means resetting the project in the BOINC Mgr, that's not a solution, because:

1) It's been done and it didn't resolve anything, and
2) It addresses a symptom, not the cause

Is the cause bad WUs, or Rosetta 3.48? Also, what is the solution. Thanks
ID: 76338 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,888,320
RAC: 0
Message 76343 - Posted: 12 Jan 2014, 16:23:52 UTC - in response to Message 76338.  

Please, restart Ralph server....


Issue remains. This particular WU has restarted 3 TIMES, at 3 hour intervals.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=572101126

If the above suggestion actually means resetting the project in the BOINC Mgr, that's not a solution, because:

1) It's been done and it didn't resolve anything, and
2) It addresses a symptom, not the cause

Is the cause bad WUs, or Rosetta 3.48? Also, what is the solution. Thanks


Hi, Dave, boboviz post wasn't in response to yours; he was trying to alert an admin that ralph@home was down.

The "exited with zero status but no 'finished' file" occurs when some other task on your computer prevents the science app from communicating with BOINC. It is usually safe to ignore it as it will have to happen 100 times to a task before the task will give up and error out. Since it's happening to you at such regular intervals I suspect you recently set some scan to occur regularly in the background. On the BOINC forum Jord (Ageless)makes the following suggestions:
Possible causes of the "Task exited with zero status but no 'finished' file" syndrome:

1. Make sure you exclude the BOINC directory and all subdirectories (or the BOINC Data directory and all subdirectories in BOINC 6 and 7) from being actively scanned by anti-virus and anti-spyware software. Only scan when you have exited BOINC.

2. Don't defrag your disk with BOINC on.

3. Don't run Scandisk with BOINC on.

4. Disable Drive Indexing.

5. Update your motherboard chipset drivers, specifically those for your IDE or SATA controllers.

6. Disable the Time synchronization in Windows XP/Vista. Normally found under the clock (double click it in the system tray), third tab (Internet in English), uncheck the sync option.

7. When you use use BOINC's CPU throttling function, you can run into the too many exit(0)s error. The advice here is to disable the BOINC throttling (set it to 100%) and reduce the amount of CPUs/cores for BOINC to use.
** Use at most 100.0 percent of CPU time.
* In BOINC 7.0, this is done through the option On multiprocessors, use at most xxx% of the processors.


This is obviously not a Rosetta specific issue; it shows up on just about every project board at some time or another. Gary Roberts, the patient prince of einstein@home, explains what's happening in this post and the BOINC FAQ Service entry is here.

Hope this helps.

Snags
ID: 76343 · Rating: 0 · rate: Rate + / Rate - Report as offensive
AMDave

Send message
Joined: 16 Dec 05
Posts: 35
Credit: 12,576,896
RAC: 0
Message 76348 - Posted: 14 Jan 2014, 19:46:41 UTC - in response to Message 76343.  

Huge thanks Snags. Re: the list

1) already done
2) already done
3) already done
4) check
5) already done
6) check
7) already done

Problem pursists. A new WU restarted 3 times. Interestingly, at 3 hour intervals. It's odd that everything was fine for @10 days subsequent to installing the latest BOINC Mgr. Then this issue crops up.

When this happens and the WU is restarted, does the computation begin anew for the WU, or does it pick up near the point where it exited?
ID: 76348 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,688,048
RAC: 10,544
Message 76349 - Posted: 14 Jan 2014, 21:18:12 UTC - in response to Message 76348.  
Last modified: 14 Jan 2014, 21:18:54 UTC

When this happens and the WU is restarted, does the computation begin anew for the WU, or does it pick up near the point where it exited?

Some types of task are able to checkpoint at certain intervals, and all tasks will checkpoint when a model is completed. A task (as you see them in BOINC Manager) can contain multiple decoys/models, so at the end of each of these there will effectively be a checkpoint.

So, if your computer reaches either a checkpoint, or completes a model and moves on to another one (still within the same task), then it will pick up from that point when Rosetta restarts. If it doesn't reach one of those points before being stopped then it will restart the task from the beginning.
ID: 76349 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,888,320
RAC: 0
Message 76357 - Posted: 17 Jan 2014, 14:43:35 UTC - in response to Message 76349.  

When this happens and the WU is restarted, does the computation begin anew for the WU, or does it pick up near the point where it exited?

Some types of task are able to checkpoint at certain intervals, and all tasks will checkpoint when a model is completed. A task (as you see them in BOINC Manager) can contain multiple decoys/models, so at the end of each of these there will effectively be a checkpoint.

So, if your computer reaches either a checkpoint, or completes a model and moves on to another one (still within the same task), then it will pick up from that point when Rosetta restarts. If it doesn't reach one of those points before being stopped then it will restart the task from the beginning.


I see Danny has answered your question so I'll just chime back in to to say this doesn't cause a problem for rosetta@home, it just increases the computer cycles per workunit causing a bit of inefficiency on your end. Eventually you will see a task error out when a model can't complete (after a hundred tries) but I doubt it will happen very often.

What else changed around the time you updated BOINC? Maybe I'm fixated on the three hour interval, but it seems most likely to be caused by Windows or some software other than BOINC. As I don't run Windows I don't know how you can see what's happening every three hours. If no one here has a suggestion I would post on the BOINC message boards where both BOINC and Windows gurus hang out and see if they don't have some useful ideas. You might want to use "Task ... exited with zero status but no 'finished' file" as the message title and be sure and give them the details of your troubleshooting efforts (Jord's checklist) in your first post.

Good Luck. Let us know what you discover.

Snags
ID: 76357 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 76363 - Posted: 19 Jan 2014, 15:24:49 UTC

If a task finds itself restarting from the same point more than 5 times, it is ended for you. This might occur if you have a number of reboots to install operating system or application fixes, or have a long-running task on a day when your machine doesn't run very long before you power off, or if you do not keep tasks in memory and suspend a task repeatedly. But otherwise, this check is there to help ensure that if a task doesn't seem to be running properly on your machine, that it doesn't get hung up.

snagles mentioned 100 tries, I just wanted to reassure folks that it only takes 5. I can't recall exactly if the 5th try is the one that aborts, or if five restarts would actually be detected on a 6th start attempt. The idea was to let a task survive some of the normal PC activities of installing fixes etc. but to get it killed if it's not running properly.
Rosetta Moderator: Mod.Sense
ID: 76363 · Rating: 0 · rate: Rate + / Rate - Report as offensive
SBF-GODS-STONE

Send message
Joined: 6 Nov 05
Posts: 15
Credit: 44,784
RAC: 0
Message 76781 - Posted: 29 May 2014, 22:07:26 UTC

Just a note to let you know I have found a small error in processing of you tasks. I run with a 6 gb work disk for Boinc tasks and I have noticed when things slow down it tends to be "Fragmented Work Disk". I have isolated it to these tasks rather than any additional projects. Every 5 hours I defrag this work disk and off it go at full speed (AND 4.7 GHZ 4 Core CPU).

Thanks...
ID: 76781 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 76782 - Posted: 30 May 2014, 4:07:21 UTC
Last modified: 30 May 2014, 4:08:23 UTC

Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag?

6GB doesn't sound like much when you have 4 CPUs. Would it behave better if the work disk were larger? It would see as though all of the storage used for a given BOINC slots folder would be freed when the task is completed.
Rosetta Moderator: Mod.Sense
ID: 76782 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 76784 - Posted: 30 May 2014, 22:15:12 UTC - in response to Message 76781.  

Just a note to let you know I have found a small error in processing of you tasks. I run with a 6 gb work disk for Boinc tasks and I have noticed when things slow down it tends to be "Fragmented Work Disk". I have isolated it to these tasks rather than any additional projects. Every 5 hours I defrag this work disk and off it go at full speed (AND 4.7 GHZ 4 Core CPU).

Thanks...


Out of curiosity, is your swap file on the same partition as your BOINC data directory?

You appear to be running Windows XP with 3GB of memory, so that is about 750MB per core (not including any memory reserved for system processes). As Rosetta tasks are heavily dependent on memory you may experience shortages which the system will try to alleviate by putting more pressure on the swap file. Frequent changes to your swap file may in turn be encouraging fragmentation in the rest of the partition.

Also, if you are hitting the limit of memory with 4 Rosetta tasks at the same time, that will be another reason for the slow down.

If other projects are working fine that lends weight to the memory theory, as Rosetta is the heaviest memory user of the BOINC projects I am familiar with.

I'd suggest cutting down to 2 or 3 cores for a day or two to see if there is any significant reduction in fragmentation.
ID: 76784 · Rating: 0 · rate: Rate + / Rate - Report as offensive
SBF-GODS-STONE

Send message
Joined: 6 Nov 05
Posts: 15
Credit: 44,784
RAC: 0
Message 76786 - Posted: 31 May 2014, 6:48:25 UTC - in response to Message 76782.  

Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag?

WHAT IS THE REASON TO WRITE TO DISK AT A GIVEN TIME? CAN YOU SAY GET A MEMORY BUFFER FOR THIS PROCESS AND REPLACE THE WRITE TO DISK ? OR DO YOU WRITE TO DISK FOR A SEPERATE TASK TO FIND??? I.E. TEMP WRITE.... If you write to a buffer the system will push it around in memory or send it to the swap file and what ever function requires it the system will retun it to memory. WRITING TO DISK IS VERY TIME AND PROCESS TIME HEAVY.... MY Memory is very fast compared to disk writing and reading (this is true in all SYSTEMS).

6GB doesn't sound like much when you have 4 CPUs. Would it behave better if the work disk were larger? It would see as though all of the storage used for a given BOINC slots folder would be freed when the task is completed.


NO THE BOINC DISK IS SEPERATE FROM THE SYSTEM SWAP DISK (CHANNELS AND DISKS)

ID: 76786 · Rating: 0 · rate: Rate + / Rate - Report as offensive
SBF-GODS-STONE

Send message
Joined: 6 Nov 05
Posts: 15
Credit: 44,784
RAC: 0
Message 76788 - Posted: 1 Jun 2014, 18:03:01 UTC - in response to Message 76786.  

Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag?

WHAT IS THE REASON TO WRITE TO DISK AT A GIVEN TIME? CAN YOU SAY GET A MEMORY BUFFER FOR THIS PROCESS AND REPLACE THE WRITE TO DISK ? OR DO YOU WRITE TO DISK FOR A SEPERATE TASK TO FIND??? I.E. TEMP WRITE.... If you write to a buffer the system will push it around in memory or send it to the swap file and what ever function requires it the system will retun it to memory. WRITING TO DISK IS VERY TIME AND PROCESS TIME HEAVY.... MY Memory is very fast compared to disk writing and reading (this is true in all SYSTEMS).

6GB doesn't sound like much when you have 4 CPUs. Would it behave better if the work disk were larger? It would see as though all of the storage used for a given BOINC slots folder would be freed when the task is completed.


NO THE BOINC DISK IS SEPERATE FROM THE SYSTEM SWAP DISK (CHANNELS AND DISKS)


JUST FOR THE INFO SIDE, I HAVE ENLARGED THE BOINC DATA WORK DISK TO 10+ GB
ID: 76788 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 76789 - Posted: 1 Jun 2014, 22:33:18 UTC - in response to Message 76788.  
Last modified: 1 Jun 2014, 22:33:47 UTC

WHAT IS THE REASON TO WRITE TO DISK AT A GIVEN TIME? CAN YOU SAY GET A MEMORY BUFFER FOR THIS PROCESS AND REPLACE THE WRITE TO DISK ?

NO THE BOINC DISK IS SEPERATE FROM THE SYSTEM SWAP DISK (CHANNELS AND DISKS)


JUST FOR THE INFO SIDE, I HAVE ENLARGED THE BOINC DATA WORK DISK TO 10+ GB


You may not be aware, but writing entirely in capitals is considered to be the equivalent of shouting on most internet forums. As I am a volunteer like you I choose not to interact with shouting people most of the time as it usually just stresses me out for no benefit. However, I will assume that you aren't aware of the shouting issue.

I am not an expert but I believe Rosetta will write to disk for several reasons:

1. When data is downloaded ready to start the task.
2. When using the swap file.
3. When saving progress at a check point so your work isn't lost in the case of a system failure or shut down.
4. When the task is completed and ready to upload to the server.


You mentioned in one of your posts that you have fast memory, however as I pointed out before, you don't have enough memory. Rosetta tasks frequently use hundreds of MB of memory per task so if you have 4 tasks running at once you have no spare capacity. At best you will be using your swap file almost constantly. At worst Rosetta will be checkpointing and suspending a task while it waits for more memory to become available.
ID: 76789 · Rating: 0 · rate: Rate + / Rate - Report as offensive
SBF-GODS-STONE

Send message
Joined: 6 Nov 05
Posts: 15
Credit: 44,784
RAC: 0
Message 76791 - Posted: 2 Jun 2014, 17:54:15 UTC - in response to Message 76789.  

WHAT IS THE REASON TO WRITE TO DISK AT A GIVEN TIME? CAN YOU SAY GET A MEMORY BUFFER FOR THIS PROCESS AND REPLACE THE WRITE TO DISK ?

NO THE BOINC DISK IS SEPERATE FROM THE SYSTEM SWAP DISK (CHANNELS AND DISKS)


JUST FOR THE INFO SIDE, I HAVE ENLARGED THE BOINC DATA WORK DISK TO 10+ GB


You may not be aware, but writing entirely in capitals is considered to be the equivalent of shouting on most internet forums. As I am a volunteer like you I choose not to interact with shouting people most of the time as it usually just stresses me out for no benefit. However, I will assume that you aren't aware of the shouting issue.

I am not an expert but I believe Rosetta will write to disk for several reasons:

1. When data is downloaded ready to start the task.
2. When using the swap file.
3. When saving progress at a check point so your work isn't lost in the case of a system failure or shut down.
4. When the task is completed and ready to upload to the server.


You mentioned in one of your posts that you have fast memory, however as I pointed out before, you don't have enough memory. Rosetta tasks frequently use hundreds of MB of memory per task so if you have 4 tasks running at once you have no spare capacity. At best you will be using your swap file almost constantly. At worst Rosetta will be checkpointing and suspending a task while it waits for more memory to become available.


No I was not aware of the shouting....... I do not blog unless I have too let someone who needs to know what is happening. Thx. Much for the info.

Now about the memory, the CPU's don't use the system memory for processing intructions. This CPU chip has for L2 and L3 cashe storage 1 and 2 gig's of memory on the chip. All other activity is done with system resources. The writing that the developers are doing I am only assuming is for computation and temp. storage between tasks. If they use Memory (if Possible) to do this type of operation the tasks will run more effectively. The overhead for disk writing taxes the process very heavly in general. It may not be possible but thought I would mention it. I wrote system software for years and have a little idea what happens in systems.

ID: 76791 · Rating: 0 · rate: Rate + / Rate - Report as offensive
SBF-GODS-STONE

Send message
Joined: 6 Nov 05
Posts: 15
Credit: 44,784
RAC: 0
Message 76792 - Posted: 2 Jun 2014, 18:00:52 UTC - in response to Message 76782.  

Now you have me curious. How does one determine the cause of fragmentation? And (since it sounds like you are well-versed in such things) what might the application developers do differently to help things run better without manual defrag?

6GB doesn't sound like much when you have 4 CPUs. Would it behave better if the work disk were larger? It would see as though all of the storage used for a given BOINC slots folder would be freed when the task is completed.


Thks for E-mail the system yours or mind hung and some how double wrote that message, I was unaware.

I did raise the work disk(Boinc) to 10+ GIG's. The result so far is good, you may have pointed out that Boinc should be installed on bigger Disk's.

Have a good day.
ID: 76792 · Rating: 0 · rate: Rate + / Rate - Report as offensive
SBF-GODS-STONE

Send message
Joined: 6 Nov 05
Posts: 15
Credit: 44,784
RAC: 0
Message 76793 - Posted: 2 Jun 2014, 18:07:43 UTC - in response to Message 76363.  

If a task finds itself restarting from the same point more than 5 times, it is ended for you. This might occur if you have a number of reboots to install operating system or application fixes, or have a long-running task on a day when your machine doesn't run very long before you power off, or if you do not keep tasks in memory and suspend a task repeatedly. But otherwise, this check is there to help ensure that if a task doesn't seem to be running properly on your machine, that it doesn't get hung up.

snagles mentioned 100 tries, I just wanted to reassure folks that it only takes 5. I can't recall exactly if the 5th try is the one that aborts, or if five restarts would actually be detected on a 6th start attempt. The idea was to let a task survive some of the normal PC activities of installing fixes etc. but to get it killed if it's not running properly.


Just saw this , I run this machine 24 hrs 7 days a week. I schedule work for two projects at a time because it runns way faster. Also running four units at once has no real problems with output.

Thks..
ID: 76793 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · 33 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org