Aborted work unit and memory usage

Questions and Answers : Windows : Aborted work unit and memory usage

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
idb

Send message
Joined: 17 Sep 05
Posts: 2
Credit: 100,029
RAC: 0
Message 115 - Posted: 17 Sep 2005, 21:36:46 UTC

Just a couple of observations...

I had to abort one of the work units I downloaded this morning. It was running OK, up to about 80% in around 5 hours, when I closed down BOINC (to check if it was causing a general slowdown problem I was seeing). When I restarted BOINC the rosetta w/u started up again but there must have been some error caused by the restart as it was running very slowly. It took 3 hours or so to do another 10% and then appeared to get stuck. I aborted it after 8 hours. The elapsed time/time to go were also reset to 0 after the BOINC restart, although the % completed showed the correct value?

Memory usage is a bit excessive! I've just started another w/u and it is currently using over 200MB. I now think the general sluggishness I was seeing on my system (see above) was probably caused by memory swapping.

BOINC 4.45, XP home, 512 MB P4 @ 3.2 MHz

Ian
ID: 115 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chris Marshall

Send message
Joined: 28 Sep 05
Posts: 1
Credit: 11,038
RAC: 0
Message 1285 - Posted: 12 Oct 2005, 11:45:36 UTC

I have also seen the memory issue on my system, I have a P4 - 2.8Ghz with HT enabled. Each WU is currently using 180Mb of Virtual Memory. Lucikly I have 2Gb Ram so it is not effecting the performance much but still that is a lot of memory to be using.
ID: 1285 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 2850 - Posted: 10 Nov 2005, 21:40:10 UTC - in response to Message 1285.  
Last modified: 10 Nov 2005, 21:41:48 UTC

I have also seen the memory issue on my system, I have a P4 - 2.8Ghz with HT enabled. Each WU is currently using 180Mb of Virtual Memory. Lucikly I have 2Gb Ram so it is not effecting the performance much but still that is a lot of memory to be using.


Truth be told, 180 Mb does not necessarily mean it's using that much physical.

If you go grab a copy of Process Explorer from Sysinternals it can provide some useful insight into memory usage.

Brief crash course on the subject. With virtual memory, there's two sizes you care about: the Virtual size and the WS size (i.e. working set size). The virtual size (180 mb) is the total size of the image: code and data, but not all of that is resident in physical memory, some of it will be in the swap file.

The working set size is the size of the actual portion that is resident in physical memory. That being said, Rosetta IS a little on the greedy size, typical WSS values range from 50 Mb to 60 Mb in my experience.

That being said, Process Explorer will show VM size and WS size values if you pull down the View menu, chose "Select Colums" (last entry), switch to the Process Performance tab, check Virtual Size and Working Set Size, and click <OK>.

-- Edit -- Made the URL work --
ID: 2850 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vester
Avatar

Send message
Joined: 2 Nov 05
Posts: 257
Credit: 3,259,514
RAC: 14,144
Message 3255 - Posted: 15 Nov 2005, 5:47:38 UTC
Last modified: 15 Nov 2005, 6:08:58 UTC

Alex Nichol's Virtual Memory in Windows XP is good.

I have two computers, one P4 and one AMD, running Windows XP and each has 512 KB of RAM. Why run an outdated version of BOINC? Are you also running an optimized, older version of Rosetta? Running the latest client is important to the project, and I would expect that earlier revisions cannot handle more complex jobs.

I haven't been here long, but I see is no reason to dump a job as long as it is running.


ID: 3255 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Garry

Send message
Joined: 20 Nov 05
Posts: 3
Credit: 1,326,602
RAC: 0
Message 3773 - Posted: 20 Nov 2005, 22:09:38 UTC - in response to Message 1285.  

I saw a recommendation in the system requirements that Rosetta should be allowed to stay in memory when not running. And that the code does a checkpoint four times or so during each work unit.

I suspected that if I didn't let it stay in memory, it would lose all work since the last checkpoint. I tested.

The first time BOINC gave the processor to another experiment (and kicked Rosetta out of memory), Rosetta reported "Result 1n0u__abrelaxmode_random_length20_jitter02_omega_sim_aneal_bab100_12350_0 exited with zero status but no 'finished' file". CPU time and progress reported zero.

Is it reasonable to assess that Rosetta didn't do a checkpoint during the time it ran on my machine? And that the time my machine contributed to Rosetta was lost?
ID: 3773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Frank

Send message
Joined: 29 Nov 05
Posts: 1
Credit: 2,256
RAC: 0
Message 5609 - Posted: 8 Dec 2005, 17:37:20 UTC - in response to Message 3255.  

Alex Nichol's Virtual Memory in Windows XP is good.

I have two computers, one P4 and one AMD, running Windows XP and each has 512 KB of RAM. Why run an outdated version of BOINC? Are you also running an optimized, older version of Rosetta? Running the latest client is important to the project, and I would expect that earlier revisions cannot handle more complex jobs.

I haven't been here long, but I see is no reason to dump a job as long as it is running.




What if it says it is "running", but is making no progress? When I first signed onto Boinc a couple of weeks ago, I got a Rosetta work unit which said "CPU time 15 sec - 8hr to completion - running - 0% completed" for several days. I dumped it and got a new one because I had read in another forum that this solved someone else's problems with work units hanging. This one ran for a while but has said "CPU time5 hrs to completion - running - 30% completed" for 4 days now.

In the "Messages" section, the only thing it says for Rosetta is

"Pausing result 1dcj__abrelax_rand_len10_jit02_omega_sim_filters_47230_0 (removed from memory)"

Fine - it's "pausing", but why? and how do I get it to start running? I don't see anything in previous messages to suggest an answer.

I have no problem running Predictor sets - several have been run and turned in - and memory is also not a problem.
ID: 5609 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R/B

Send message
Joined: 8 Dec 05
Posts: 195
Credit: 28,095
RAC: 0
Message 5610 - Posted: 8 Dec 2005, 17:37:44 UTC

I got a client error here. Was at 50% and it switched over to another project, as is normal and then the whole unit failed. I run boinc 5.2.1 on win xphome on athlon 3,000+


12/8/2005 12:26:43 PM|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_01515_0 ( - exit code -1073741819 (0xc0000005))


'Topology sample' ? Is this some kind of calibration unit I did half of because I'm new?

Curiously...credit is waiting to be granted under 'your account' when I click on it. But it says 'client error' next to it...

<--------CONFUSED.

Thanks for help in advance.
Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers.


ID: 5610 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rebirther
Avatar

Send message
Joined: 17 Sep 05
Posts: 116
Credit: 41,315
RAC: 0
Message 5612 - Posted: 8 Dec 2005, 17:49:57 UTC - in response to Message 5610.  

I got a client error here. Was at 50% and it switched over to another project, as is normal and then the whole unit failed. I run boinc 5.2.1 on win xphome on athlon 3,000+


12/8/2005 12:26:43 PM|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_01515_0 ( - exit code -1073741819 (0xc0000005))


'Topology sample' ? Is this some kind of calibration unit I did half of because I'm new?

Curiously...credit is waiting to be granted under 'your account' when I click on it. But it says 'client error' next to it...

<--------CONFUSED.

Thanks for help in advance.


You must set in preferences "Leave application in memory" to solve this problem by switching between projects. If you are getting client errors this will not be credited!

ID: 5612 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
R/B

Send message
Joined: 8 Dec 05
Posts: 195
Credit: 28,095
RAC: 0
Message 5615 - Posted: 8 Dec 2005, 18:07:06 UTC
Last modified: 8 Dec 2005, 18:09:17 UTC

Thanks, I had it set to leave in memory but forgot to click 'update' on my Boinc mgr.. I haven't slept in a while. I run a few other projects but am new to Rosetta as of today. Seemed like this athlon was humming along at 50% complete of that unit at 45 minutes into it. So I just put rosetta on my old 500 mhz machine right now....we'll see how fast my 2nd and older machine can get them done.

It's trying to d/l 30 Rosetta units on this 2nd machine which is a 500mhz 256 Ram on dialup. Is this normal? It's set to a 3 day cache...
Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers.


ID: 5615 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R/B

Send message
Joined: 8 Dec 05
Posts: 195
Credit: 28,095
RAC: 0
Message 5626 - Posted: 8 Dec 2005, 19:41:37 UTC

Ahhh, I think I get it. Rosetta sends multiple packets that comprise each individual work unit. So it looked like I was downloading 20 or 30 when I was really just downloading 4 or 5.
Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers.


ID: 5626 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Gio

Send message
Joined: 5 Jan 06
Posts: 2
Credit: 18,139
RAC: 0
Message 8629 - Posted: 9 Jan 2006, 8:51:02 UTC

I get a lot of client errors, I'm about to leave this project. I get 10% of success only if I let rosetta running alone for hours. So I have to suspend all other projects.

I want to give last chance to this project.

As I read here I need to set "Leave in memory".

I cannot find the option "Leave in memory". Anyone so kind to tell me exactly "where" to set this option?

Thanks
Gio
ID: 8629 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
J D K
Avatar

Send message
Joined: 23 Sep 05
Posts: 168
Credit: 101,266
RAC: 0
Message 8641 - Posted: 9 Jan 2006, 13:28:36 UTC - in response to Message 8629.  

I get a lot of client errors, I'm about to leave this project. I get 10% of success only if I let Rosetta running alone for hours. So I have to suspend all other projects.

I want to give last chance to this project.

As I read here I need to set "Leave in memory".

I cannot find the option "Leave in memory". Anyone so kind to tell me exactly "where" to set this option?

Thanks
Gio



Go to your acct look under preferences and you will find it...
BOINC Wiki

ID: 8641 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Gio

Send message
Joined: 5 Jan 06
Posts: 2
Credit: 18,139
RAC: 0
Message 8655 - Posted: 9 Jan 2006, 16:23:35 UTC - in response to Message 8641.  

I get a lot of client errors, I'm about to leave this project. I get 10% of success only if I let Rosetta running alone for hours. So I have to suspend all other projects.

I want to give last chance to this project.

As I read here I need to set "Leave in memory".

I cannot find the option "Leave in memory". Anyone so kind to tell me exactly "where" to set this option?

Thanks
Gio



Go to your acct look under preferences and you will find it...


thanks, I found it. (for some reason I did not find it before)

Hope now all is fixed.


ID: 8655 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
verdy_p

Send message
Joined: 8 Feb 06
Posts: 5
Credit: 8,785
RAC: 0
Message 10995 - Posted: 20 Feb 2006, 11:45:49 UTC - in response to Message 1285.  

I have also seen the memory issue on my system, I have a P4 - 2.8Ghz with HT enabled. Each WU is currently using 180Mb of Virtual Memory. Lucikly I have 2Gb Ram so it is not effecting the performance much but still that is a lot of memory to be using.


You're right. It's completely ridiculous to dedicate permanently so much virtual space on disk when Rosetta is suspended, because we are running another application, or because time issharedwith other BOINC projects.

The main reason is a serious bug in Rosetta's interrupt handler whose programming is definitely not multithread safe and corrupts the main computing thread.

I don't like leaving Rosetta in memory. It is against the philosophy of BOINC projects that should use ONLY idle computing resources.Rosetta should retire from BOINC projects as long as this bug is not corrected (this bug may even generate false scientific results due to the possible data corruption that it may generate even if a work unit apparently does not terminate abruptly with an unrecoverable error).

Also: please save computing snapshots more often. When there's a failure, the work unit state should be recovered without too much CPU time lost, and will progress enough until the next snapshot to bypass a single failure caused by an external event. Note that when BOINC is running as a screensaver, itmaybe interrrupted very fast before any significant progress has been done. And suchevent mayoccur several times rapidly. This is not a failure, but a commonissue of screen savers that are sometimes triggered when the user is just reading a documentor has paused for a smalltime butdoes not want the screensaverto come into interrupt his job.

To solve this problem: Rosetta should enter sleepmode for a smalltime ifit gets paused, and unless it has not been resumed after 2 minutes, it should shutdown and save its computing state to disk and exit. If the computing thread is locking a critical data section, the interrupt handler may not be able to terminate the job immediately and should start a timer for a delayed retry after 2 seconds up to 1 minute, in a loop. It MUST make all efforts to exit the Rosetta process and free memory as fast as possible.

For now, Rosettta just stinks, and abuses users computing resources.
ID: 10995 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
teepsy

Send message
Joined: 5 Jan 06
Posts: 2
Credit: 4,926
RAC: 0
Message 16135 - Posted: 13 May 2006, 3:49:38 UTC - in response to Message 10995.  
Last modified: 13 May 2006, 3:50:45 UTC

I've been running only Rosetta for a few days now, but something happened yesterday when it started aborting every unit I've been getting. Sometimes it aborts at 22 or so percent, but usually at about 77-78%. Any ideas? Am I the only one that has been having this problem the past two days?
ID: 16135 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 16140 - Posted: 13 May 2006, 5:07:49 UTC - in response to Message 16135.  

I've been running only Rosetta for a few days now, but something happened yesterday when it started aborting every unit I've been getting. Sometimes it aborts at 22 or so percent, but usually at about 77-78%. Any ideas? Am I the only one that has been having this problem the past two days?

Teepsy,

Some of the errors are actually work units that the system thaought were hung and so it automatically aborted them. This is called the "watchdog". Some of the others may be a known BOINC issue. If you have nnot done so already, you should upgrade your BOINC software to version 5.4.9. That has fixed a lot of problems for people. You can get the software here.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 16140 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
teepsy

Send message
Joined: 5 Jan 06
Posts: 2
Credit: 4,926
RAC: 0
Message 16191 - Posted: 13 May 2006, 19:02:53 UTC - in response to Message 16140.  
Last modified: 13 May 2006, 19:07:34 UTC

I've been running only Rosetta for a few days now....it aborts at 22 or so percent, but usually at about 77-78%.

Teepsy, ..."watchdog". Some of the others may be a known BOINC issue....upgrade your BOINC software to version 5.4.9.here.


Thanks for the 5.4.9 - it looks much better now! However, I got one at 23% and another completely screwed up. Do you think I should just not do Rosetta for awhile? I hate to be messing up their data!

Oh! When I go into my "results" page - it keeps telling me I have one pending, but I do not have anything working. I just noticed that some of my units are showing up as completed (and I have received credit) but others keep saying client error so, of course, I don't get credit.

I really don't know what to do; I feel badly about this!

ID: 16191 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
dr.frank

Send message
Joined: 10 Apr 06
Posts: 1
Credit: 23,691
RAC: 0
Message 19204 - Posted: 24 Jun 2006, 11:22:31 UTC

I'm sorry to have to agree with verdy. I'm running 4 different applications, and NONE of the other pose any problems.

Every time I get upstairs where I have a server running, the screen is blocked with a Rosetta screen and error message.

So if this keep in memory setting is the only way to solve this, Rosetta is out of the door, even if it has a pretty screensaver...

Get your act together, you are getting a lot of calculus - time for free, the LEAST you can do is make your software work!

Frank
ID: 19204 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
MG

Send message
Joined: 27 Nov 06
Posts: 1
Credit: 0
RAC: 0
Message 34267 - Posted: 7 Jan 2007, 9:40:27 UTC - in response to Message 10995.  

I have also seen the memory issue on my system, I have a P4 - 2.8Ghz with HT enabled. Each WU is currently using 180Mb of Virtual Memory. Lucikly I have 2Gb Ram so it is not effecting the performance much but still that is a lot of memory to be using.


You're right. It's completely ridiculous to dedicate permanently so much virtual space on disk when Rosetta is suspended, because we are running another application, or because time issharedwith other BOINC projects.

The main reason is a serious bug in Rosetta's interrupt handler whose programming is definitely not multithread safe and corrupts the main computing thread.

I don't like leaving Rosetta in memory. It is against the philosophy of BOINC projects that should use ONLY idle computing resources.Rosetta should retire from BOINC projects as long as this bug is not corrected (this bug may even generate false scientific results due to the possible data corruption that it may generate even if a work unit apparently does not terminate abruptly with an unrecoverable error).

Also: please save computing snapshots more often. When there's a failure, the work unit state should be recovered without too much CPU time lost, and will progress enough until the next snapshot to bypass a single failure caused by an external event. Note that when BOINC is running as a screensaver, itmaybe interrrupted very fast before any significant progress has been done. And suchevent mayoccur several times rapidly. This is not a failure, but a commonissue of screen savers that are sometimes triggered when the user is just reading a documentor has paused for a smalltime butdoes not want the screensaverto come into interrupt his job.

To solve this problem: Rosetta should enter sleepmode for a smalltime ifit gets paused, and unless it has not been resumed after 2 minutes, it should shutdown and save its computing state to disk and exit. If the computing thread is locking a critical data section, the interrupt handler may not be able to terminate the job immediately and should start a timer for a delayed retry after 2 seconds up to 1 minute, in a loop. It MUST make all efforts to exit the Rosetta process and free memory as fast as possible.

For now, Rosettta just stinks, and abuses users computing resources.


Still got the same problem with rosetta@home while all my other projects are running without any difficulties. There still seems to be no solution. So I'm out, too!
ID: 34267 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Paul Massaria

Send message
Joined: 23 Nov 06
Posts: 1
Credit: 20,520
RAC: 0
Message 34310 - Posted: 7 Jan 2007, 18:42:41 UTC - in response to Message 34267.  

I have also seen the memory issue on my system, I have a P4 - 2.8Ghz with HT enabled. Each WU is currently using 180Mb of Virtual Memory. Lucikly I have 2Gb Ram so it is not effecting the performance much but still that is a lot of memory to be using.


You're right. It's completely ridiculous to dedicate permanently so much virtual space on disk when Rosetta is suspended, because we are running another application, or because time issharedwith other BOINC projects.

The main reason is a serious bug in Rosetta's interrupt handler whose programming is definitely not multithread safe and corrupts the main computing thread.

I don't like leaving Rosetta in memory. It is against the philosophy of BOINC projects that should use ONLY idle computing resources.Rosetta should retire from BOINC projects as long as this bug is not corrected (this bug may even generate false scientific results due to the possible data corruption that it may generate even if a work unit apparently does not terminate abruptly with an unrecoverable error).

Also: please save computing snapshots more often. When there's a failure, the work unit state should be recovered without too much CPU time lost, and will progress enough until the next snapshot to bypass a single failure caused by an external event. Note that when BOINC is running as a screensaver, itmaybe interrrupted very fast before any significant progress has been done. And suchevent mayoccur several times rapidly. This is not a failure, but a commonissue of screen savers that are sometimes triggered when the user is just reading a documentor has paused for a smalltime butdoes not want the screensaverto come into interrupt his job.

To solve this problem: Rosetta should enter sleepmode for a smalltime ifit gets paused, and unless it has not been resumed after 2 minutes, it should shutdown and save its computing state to disk and exit. If the computing thread is locking a critical data section, the interrupt handler may not be able to terminate the job immediately and should start a timer for a delayed retry after 2 seconds up to 1 minute, in a loop. It MUST make all efforts to exit the Rosetta process and free memory as fast as possible.

For now, Rosettta just stinks, and abuses users computing resources.


Still got the same problem with rosetta@home while all my other projects are running without any difficulties. There still seems to be no solution. So I'm out, too!


Is this the reason I get so many client errors on the results?

ID: 34310 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Questions and Answers : Windows : Aborted work unit and memory usage



©2024 University of Washington
https://www.bakerlab.org