Odd wu/wu behaviour?

Message boards : Number crunching : Odd wu/wu behaviour?

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 13657 - Posted: 13 Apr 2006, 15:53:42 UTC
Last modified: 13 Apr 2006, 16:49:43 UTC

17128664 seemed to be going wrong. It was still under 2% finished after more then an hour of crunching. It was moving, because I'd noticed it at 1.87% and later saw it had crawled to 1.9%, so not a 1% issue.

Thing is, next time I looked at it, ~5 minutes later, it was complate and ready to report. Note despite a 7200 seconds preference, it ran for a lot less.

This is total guess work, but is the short run time because the first "decoy" took more then half the allotted time slot? Still odd with the progress bar. My current wu, seems to be doing the same thing. It's been running for 47 minutes but is only 1.2% complete.

The wu on my other machine has been running 1 hour 23 and showing 62% finished which is much more like it? Both are running 4.98, (4.83 of course).

*** EDIT ***

That wu is now pre-empted at 59:53 and is showing 1.26% complete.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13657 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13681 - Posted: 13 Apr 2006, 20:23:19 UTC
Last modified: 13 Apr 2006, 20:25:52 UTC

Have a look at my reply to some other fellow here

He asked:
Here's an odd one...
Rosetta 4.98, WU _largescale_large_fullatom_relax_dec7449_1_08_2.pdb_431_25_0 running with BOINC 5.2.13 on Windows XP 64-bit SP1 on an Athlon 64 3200+ with 512MB RAM. I also have SETI@home on that machine.

Starts up, 50% done, 2 hours CPU time used, runs for about an hour, at the end of that time it's still about 50% done, but has 3 hours CPU time; swaps out... SETI runs for an hour and swaps out... and then Rosetta swaps in again, 50% done, 2 hours (!) CPU time used. Caught this one because the accepted protein shape is pretty uncommon (looks sort of like a lollipop).

Shall I kill it or do you want me to keep watching it for a while? It's been on here for three days now, which means ballpark 36 hours, but I think I have only 2 hours credit for it...



these *_largescale_large_fullatom_relax* WUs are very big WUs which take a loooong time per model, on P4s they take 2-4 HOURS PER MODEL, so unless you have "Leave in mem when pre-empted"=YES, the PC can't complete even 1 model in 2hr before Rosetta gets swapped out to run SETI and your PC starts the WU from 0 again...

Solution: increase "time between swaps" to e.g. 4hr or IDEALLY (if your PC has enough RAM and/or run few BOINC projects) set "leave in mem when preempted"=YES

I always choose leave in mem=YES.

This very example is why Rosetta needs a BigWU flag in preferences IMHO...

AMD_is_logical also explained it in a previous comment:

Another problem is that the bug requiring "keep in memory" has been fixed. That means a lot of people are setting "keep in memory" to "no". There are places in some WUs that require more than an hour to get to the next checkpoint, so with the default switching time of one hour the WU will keep dropping back to the last checkpoint indefinitly.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13681 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 13682 - Posted: 13 Apr 2006, 20:30:14 UTC

I didn't know the "leave in memory" problem was still an issue here, but have been a member for a long time, and habitually set "leave in memory" to true anyway.

I've watched a couple of wu now, and there does seem to be a difference in the way my 2 machines here run. Foniks Seems to have a working progress bar, Evesham does not. The current wu on Evesham has 1:59:47 time, (pre-empted now), but shows 1.50% complete.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13682 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13683 - Posted: 13 Apr 2006, 20:43:01 UTC
Last modified: 13 Apr 2006, 20:43:53 UTC

The "leave in memory" is no longer an issue, but Rosetta "check-points" at the end of each model (notice the Model/Step info in the screensaver). AMD_is_logical explained it very well.

Most WUs are small proteins, which take only ~10min per model.

Some recent ones are very big which take 2-4hr per model (on Pentium4!). So to finish such a WU, Rosetta needs to run on your PC for 4hrs, WITHOUT being unloaded from memory. If a PC unloads Rosetta every hour to run another project, it will never finish, as it'll start everytime from scratch.

The surest way to run Big WUs would be check the "leave in memory when preempted"=YES.

IMHO this needs to be handled by the project, submitting big jobs only to PCs which run 24/7 and/or have leave-in-mem=yes.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13683 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 13684 - Posted: 13 Apr 2006, 21:38:43 UTC

As I said above, "Leave in Memory" is set to true. This is not the issue.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13684 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13689 - Posted: 13 Apr 2006, 22:01:33 UTC - in response to Message 13683.  

IMHO this needs to be handled by the project, submitting big jobs only to PCs which run 24/7 and/or have leave-in-mem=yes.

Which will require a modification to Boinc to pass that information back to the Rosetta servers.

ID: 13689 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13690 - Posted: 13 Apr 2006, 22:12:54 UTC - in response to Message 13689.  
Last modified: 13 Apr 2006, 22:24:52 UTC

IMHO this needs to be handled by the project, submitting big jobs only to PCs which run 24/7 and/or have leave-in-mem=yes.

Which will require a modification to Boinc to pass that information back to the Rosetta servers.


The way BOINC works, global and per-project preferences are already stored on the project's (Rosetta's in this case) servers. (unless one uses a local .xml file override, which virtually nobody knows about).

Actually, you're correct in that our PC's BOINC client uses global BOINC settings from the most-recently updated profile of all BOINC project we run. So if I run Rosetta+SETI+Einstein and make a change in the global settings of e.g. Einstein, then Rosetta won't know it. I think the idea was to use some BOINC "account manager" to sync this info.

I don't know if any BOINC project is sending WUs customized to client PC's profile-preferences (BigWU=yes) and/or capabilities (RAM>512, fast CPU, 24/7 operation etc).

A BigWU flag could be part of local-preferences (like the flags for WU-runtime and %-CPU-time-taken-by-screensaver), but I don't know if the BOINC server code supports customised feeding of WUs.

Seems like this needs to coded in BOINC's scheduler/feeder https://boinc.bakerlab.org/rosetta/rah_status.php
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13690 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13700 - Posted: 14 Apr 2006, 0:08:26 UTC - in response to Message 13657.  

17128664 seemed to be going wrong. It was still under 2% finished after more then an hour of crunching. It was moving, because I'd noticed it at 1.87% and later saw it had crawled to 1.9%, so not a 1% issue.

Thing is, next time I looked at it, ~5 minutes later, it was complate and ready to report. Note despite a 7200 seconds preference, it ran for a lot less.


The WU will start at 1%. Then there will be small increments as the WU goes through several stages of the first model (less than 1% total). After each model a larger jump in percentage is made based on how many models the rosetta app thinks it can do. If it doesn't think it can fit another model in without running too far over, the percent will jump directly to 100%.
ID: 13700 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 13706 - Posted: 14 Apr 2006, 3:43:47 UTC - in response to Message 13700.  
Last modified: 14 Apr 2006, 3:44:11 UTC

17128664 seemed to be going wrong. It was still under 2% finished after more then an hour of crunching. It was moving, because I'd noticed it at 1.87% and later saw it had crawled to 1.9%, so not a 1% issue.

Thing is, next time I looked at it, ~5 minutes later, it was complate and ready to report. Note despite a 7200 seconds preference, it ran for a lot less.


The WU will start at 1%. Then there will be small increments as the WU goes through several stages of the first model (less than 1% total). After each model a larger jump in percentage is made based on how many models the rosetta app thinks it can do. If it doesn't think it can fit another model in without running too far over, the percent will jump directly to 100%.

That is the correct answer.

We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 13706 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13708 - Posted: 14 Apr 2006, 4:03:52 UTC - in response to Message 13683.  

The "leave in memory" is no longer an issue, but Rosetta "check-points" at the end of each model (notice the Model/Step info in the screensaver). AMD_is_logical explained it very well.

Most WUs are small proteins, which take only ~10min per model.

Some recent ones are very big which take 2-4hr per model (on Pentium4!). So to finish such a WU, Rosetta needs to run on your PC for 4hrs, WITHOUT being unloaded from memory. If a PC unloads Rosetta every hour to run another project, it will never finish, as it'll start everytime from scratch.

The surest way to run Big WUs would be check the "leave in memory when preempted"=YES.

IMHO this needs to be handled by the project, submitting big jobs only to PCs which run 24/7 and/or have leave-in-mem=yes.


we'd like to be able to do this, but there is no mechanism currently in boinc that allows this. during casp which is coming up soon, we may ask participants to set leave in memory = yes as there are likely to be some larger proteins
ID: 13708 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 13711 - Posted: 14 Apr 2006, 8:43:25 UTC

Having "noticed" this behaviour, I have taken to watching that machine more carefully. I can now report that all the wu's I've watched have done the same thing. The wu progresses very slowly up to around 2% then finishes. This is not the same as wu's on this machine.

Here, my current wu has 1:04:33 and 37.4% done, the box in question has 1:59:46 and 1.52% done. Both machines have work unit length set to 7200 seconds, as can be seen in the results.

The progress bar is working differently on that machine for some reason. Both are Intel P-IV systems, on similar ASUS MoBo's, both run BOINC 5.2.x, both are returning valid results after about the same amount of time. The machine showing this behaviour has considerably less memory in it then the other, (256M v 1G), and is running NT4 rather then XP.

I don't particularly care, but it does seem odd, and "odd" behaviour often is symtomatic of a deeper problem.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13711 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 13714 - Posted: 14 Apr 2006, 9:57:48 UTC

Sorry, it won't let me edit!
The WU will start at 1%. Then there will be small increments as the WU goes through several stages of the first model (less than 1% total). After each model a larger jump in percentage is made based on how many models the rosetta app thinks it can do. If it doesn't think it can fit another model in without running too far over, the percent will jump directly to 100%.

If this was the case, would not the details of my results show that only 1 structure had been produced? This is not as observed in the results for that machine.

I will set the wu length up to 4 hours to see if the behaviour changes however.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13714 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Knorr

Send message
Joined: 18 Feb 06
Posts: 21
Credit: 373,953
RAC: 0
Message 13716 - Posted: 14 Apr 2006, 11:20:53 UTC - in response to Message 13714.  


If this was the case, would not the details of my results show that only 1 structure had been produced? This is not as observed in the results for that machine.


Could you post a link to the results you are thinking about?

Because all the largescale_large_fullatom_relax WU's I've checked in your results actually has only 1 structure.

And as explained before.

The percentage will hardly reach 2% before the first model is completed. Then the percentage will jump acoordinally to your CPU run time pref.

If you have a largescale WU which would take about 3 hrs for one model and your run time is set at 2 hrs, then the WU is gonna complete that model.

Even if it's 1 hr above your settings.

And the progress procentage will start at 1% increase in small steps towards 2% all the way up to 3 hrs CPU time and then jump to 100% and completion.

- Knorr
ID: 13716 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 13720 - Posted: 14 Apr 2006, 14:22:18 UTC
Last modified: 14 Apr 2006, 14:24:19 UTC

This wu was crunched by Evesham, (the node which is showing this odd behaviour, and looks to me as if it has done 6. This one was crunched by the machine which is not showing the errant behaviour and appears to have only done 1.

Right now, this machine has a wu in progress, at 2:16:48 it shows 49.32% complete, (I have changed the target run time to 4 hours). The wu on Evesham is pre empted at the moment at 0:57:07 showing 1.53% complete.

Evesham is a slower machine, but not by a vast amount, it is a 2.533GHz Northwood, whilst Foniks is a 3.2GHz Prescott. Evesham is only running BOINC whilst Foniks is running my production web server and database server, so is doing a little other work as well.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13720 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 13721 - Posted: 14 Apr 2006, 14:25:37 UTC - in response to Message 13716.  



...
If you have a largescale WU which would take about 3 hrs for one model and your run time is set at 2 hrs, then the WU is gonna complete that model.

Even if it's 1 hr above your settings.

And the progress procentage will start at 1% increase in small steps towards 2% all the way up to 3 hrs CPU time and then jump to 100% and completion.

- Knorr


I have one of those giants right now, and after I read this thread, I set it to stay in memory, but I had to close down my computer, so now it's back to zero! Both in time and in progress, so the almost 2 hours it ran seems to be lost. :-(

Is there any possibility to let those giants write to disk more often by creating some checkpoints? Then it will only go back the the checpoint and continue from there.


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 13721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13733 - Posted: 14 Apr 2006, 16:55:05 UTC - in response to Message 13720.  
Last modified: 14 Apr 2006, 16:57:33 UTC

Having "noticed" this behaviour, I have taken to watching that machine more carefully. I can now report that all the wu's I've watched have done the same thing. The wu progresses very slowly up to around 2% then finishes. This is not the same as wu's on this machine.

Here, my current wu has 1:04:33 and 37.4% done, the box in question has 1:59:46 and 1.52% done. Both machines have work unit length set to 7200 seconds, as can be seen in the results.


adrianxw, I think the "odd" %-progress behaviour you're seeing might be because the WUs running on your PC can be very different. It can be apples and oranges. A model on a HBLR* WU might take 10min and a *_largescale_large_fullatom_relax_* might take 3hr.

Rosetta is very different than most other BOINC projects, which have more or less constant size WUs.

WU %-progress might not increase linearly with time, as AMD_is_logical / Snake_Doc said. Especially if you're using very short WU runtime.

The *_largescale_large_fullatom_relax_* WUs are very big and sometimes "Steps" remains at 0. Usually just one "Model" will fit in the 7200 seconds (2hr) timeframe, in which case the %-progress indicator may stay at e.g. 1.5% for 1-3 hours while computing the first model and then finish, realise that it can't run a second model per your WU-runtime settings (7200 sec might have been already exceeded), so it jumps to 100% and finishes.

I use 8-hr WU-runtimes (instead of 2hr default) and a big WU taking 2hr per Model might jump 0% -> 25% -> 50% -> 75% -> 100% in BOINC progress

Hope this helps and I understood your questions correctly this time.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13733 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mikkie

Send message
Joined: 1 Apr 06
Posts: 9
Credit: 5,700
RAC: 0
Message 13738 - Posted: 14 Apr 2006, 17:27:01 UTC - in response to Message 13733.  

The *_largescale_large_fullatom_relax_* WUs are very big and sometimes "Steps" remains at 0. Usually just one "Model" will fit in the 7200 seconds (2hr) timeframe, in which case the %-progress indicator may stay at e.g. 1.5% for 1-3 hours while computing the first model and then finish, realise that it can't run a second model per your WU-runtime settings (7200 sec might have been already exceeded), so it jumps to 100% and finishes.

I use 8-hr WU-runtimes (instead of 2hr default) and a big WU taking 2hr per Model might jump 0% -> 25% -> 50% -> 75% -> 100% in BOINC progress.


Yeah right, ever thought about people who doesn't run 24/7 or not running power engines? There are still people overhere who do it just for the fun. I had such largescale wu but abadon it because it was still busy crunching Model 1 after 9 hours at 1.4% All I get on the moment are these things. They all get dumped.
ID: 13738 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 13752 - Posted: 14 Apr 2006, 19:07:17 UTC

I'd set my wu runtime down to 1 hour because of the problems with application 4.97. I have set it back to 4 hours, and not noticed any real change in behaviour, although I have not had it like this for long yet.

Certainly, the wu's I have running now should be with the new target time, and they are presenting VERY differently on screen as I look. Both are running largescale_large_fullatom wu's, on one machine it shows 01:11:44 and 1.58%, the other, 1:14:06 and 27.27%.

I do understand the way the system works. I am well used to non-linear progress bars. As I've said above, I don't really care, but it struck me that my 2 systems are behaving very differently, and I'd like to understand that so I am happy that there is not a hidden issue here.

I linked a couple of results further up, one crunched by the "weird" system which, to me at least, seems to show it ran several structures. Another from the machine that has a much more linear progress bar, which apparently only managed one. I'd appreciate someone who knows what those results show explaining it to me.

It is not the first time I've had a problem. I had to switch off Leiden@Home on Evesham because after the recent upgrade to the science application, my results were being judged invalid. This was due to a reporting difference between NT4 and XP.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13752 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13756 - Posted: 14 Apr 2006, 19:23:43 UTC - in response to Message 13752.  
Last modified: 14 Apr 2006, 20:22:59 UTC

Certainly, the wu's I have running now should be with the new target time, and they are presenting VERY differently on screen as I look. Both are running largescale_large_fullatom wu's, on one machine it shows 01:11:44 and 1.58%, the other, 1:14:06 and 27.27%.


On a dual-CPU PC (WinXP) in front of me I have two largescale_large_fullatom WUs running concurrently.

Both WU have been running about the same time ~1+hr, one shows progress 1.3021%, the other about 25%, very similar to you see. But mine are on the same PC.

I think that the diff is in how/whether "steps" are incremented. In one case it stays at 0, in the other it increments as usual.

Bottom line: I wouldn't worry about it!
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13756 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 42
Message 13809 - Posted: 15 Apr 2006, 7:39:36 UTC

I wasn't overly worried about it as an issue, I was simply afraid that there may be an underlying problem with Rosetta and NT4, (the latest Leiden@Home client does not work properly with NT4 for example).

This morning however, I saw I had a wu on Evesham that was 01:08:01 and 28.43% complete. So I believe the "issue" to be an non-issue as explained.

Cheers folks.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 13809 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Odd wu/wu behaviour?



©2024 University of Washington
https://www.bakerlab.org