Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 55 · Next

AuthorMessage
Ed

Send message
Joined: 2 Aug 11
Posts: 31
Credit: 662,563
RAC: 0
Message 70898 - Posted: 4 Aug 2011, 13:33:40 UTC - in response to Message 70882.  
Last modified: 4 Aug 2011, 13:39:25 UTC

Guys, I just got my first Rosetta WU, so I am now offically part of the project. :-)



So far not a happy camper.

My first WU had been runing for 12:40 Elapsed cpu time. Remaning went from 5 hours when it started to 33+ hours and was still going up when I aborted it. It only reached 6.706% done which is where it was at 4 hours Elapsed.

When I aborted the WU and sent an update request I got a new Rosetta WU.


Help me set my own expectations. The remaining on this one says 4:17. So I will expect I can complete this one in 8 hours or less.

Is that a reasonable expectation?

GenuineIntel
Pentium(R) Dual-Core CPU E5300 @ 2.60GHz
[Family 6 Model 23 Stepping 10]

Microsoft Windows 7
Home Premium x64 Edition, Service Pack 1, (06.01.7601.00

Rosetta is set to get 33% of my total CPU time. 66% to Seti. App change every 60 minutes, which seems to be working fine.
ID: 70898 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Ed

Send message
Joined: 2 Aug 11
Posts: 31
Credit: 662,563
RAC: 0
Message 70899 - Posted: 4 Aug 2011, 13:37:52 UTC - in response to Message 70898.  
Last modified: 4 Aug 2011, 13:38:32 UTC

duplicate post
ID: 70899 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 70901 - Posted: 4 Aug 2011, 14:42:29 UTC - in response to Message 70898.  

Guys, I just got my first Rosetta WU, so I am now offically part of the project. :-)



So far not a happy camper.

My first WU had been runing for 12:40 Elapsed cpu time. Remaning went from 5 hours when it started to 33+ hours and was still going up when I aborted it. It only reached 6.706% done which is where it was at 4 hours Elapsed.

When I aborted the WU and sent an update request I got a new Rosetta WU.


Help me set my own expectations. The remaining on this one says 4:17. So I will expect I can complete this one in 8 hours or less.

Is that a reasonable expectation?

GenuineIntel
Pentium(R) Dual-Core CPU E5300 @ 2.60GHz
[Family 6 Model 23 Stepping 10]

Microsoft Windows 7
Home Premium x64 Edition, Service Pack 1, (06.01.7601.00

Rosetta is set to get 33% of my total CPU time. 66% to Seti. App change every 60 minutes, which seems to be working fine.



on a rare occasion the tasks go into loop mode. also boinc mgr monitors the program and the task has a code built into that when the time goes to long it is suppose to "call" the end of task and report back. But it sounds like something got out of place and it just looped.

What is your run time for R@H? 4,8 or longer hrs?
The program will/should run until your time limit is up and then stop crunching and compile and upload the information. The reporting you can do manually or it will do it itself automatically at a later time.
ID: 70901 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TPCBF

Send message
Joined: 29 Nov 10
Posts: 109
Credit: 4,627,580
RAC: 1,835
Message 70902 - Posted: 4 Aug 2011, 15:14:41 UTC - in response to Message 70901.  

What is your run time for R@H? 4,8 or longer hrs?
The program will/should run until your time limit is up and then stop crunching and compile and upload the information. The reporting you can do manually or it will do it itself automatically at a later time.
That is something that it is certainly not doing for me...

I had reported this "freezing" WU issue several weeks ago and there was no clear answer as to what causes this...

And i just checked on the WU's I recently got and about half of them show up with "compute error", half of them seem to go through fine...

I will stop those hosts still set to run R@H from receiving new task and work on WGC for now only until this whole mess hopefully settles. :-(
Just no point in wasting resources this way...

Ralf
ID: 70902 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 114,376,131
RAC: 52,943
Message 70903 - Posted: 4 Aug 2011, 16:36:32 UTC - in response to Message 70898.  

Guys, I just got my first Rosetta WU, so I am now offically part of the project. :-)



So far not a happy camper.

My first WU had been runing for 12:40 Elapsed cpu time. Remaning went from 5 hours when it started to 33+ hours and was still going up when I aborted it. It only reached 6.706% done which is where it was at 4 hours Elapsed.

When I aborted the WU and sent an update request I got a new Rosetta WU.


Help me set my own expectations. The remaining on this one says 4:17. So I will expect I can complete this one in 8 hours or less.

Is that a reasonable expectation?

GenuineIntel
Pentium(R) Dual-Core CPU E5300 @ 2.60GHz
[Family 6 Model 23 Stepping 10]

Microsoft Windows 7
Home Premium x64 Edition, Service Pack 1, (06.01.7601.00

Rosetta is set to get 33% of my total CPU time. 66% to Seti. App change every 60 minutes, which seems to be working fine.

Hi Ed

The "time remaining" is difficult for BOINC to calculate for Rosetta so it tends to steadily increase for a while and then suddenly drop, then repeat. If you let it run BOINC will balance the projects to your preference so no worries there.
ID: 70903 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Ed

Send message
Joined: 2 Aug 11
Posts: 31
Credit: 662,563
RAC: 0
Message 70904 - Posted: 4 Aug 2011, 17:02:32 UTC - in response to Message 70901.  
Last modified: 4 Aug 2011, 17:05:49 UTC



on a rare occasion the tasks go into loop mode. also boinc mgr monitors the program and the task has a code built into that when the time goes to long it is suppose to "call" the end of task and report back. But it sounds like something got out of place and it just looped.

What is your run time for R@H? 4,8 or longer hrs?


The program will/should run until your time limit is up and then stop crunching and compile and upload the information. The reporting you can do manually or it will do it itself automatically at a later time.



I did not set a run time for R@H. I did not realize this was required.

I figured a unit will run to completion. NO?

Why would I have to set this? Where would I find this setting?
ID: 70904 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Brett Collins
Avatar

Send message
Joined: 13 Feb 11
Posts: 2
Credit: 147,888
RAC: 0
Message 70905 - Posted: 4 Aug 2011, 18:04:26 UTC - in response to Message 70158.  

We are aware that we have had some issues with bad jobs on Rosetta@home recently. We try to ensure that these bad jobs don't slip through, but they occasionally do. When that happens, your efforts to alert us to these problems are extremely important and very much appreciated.


My average credit is down and checking completed units showed tens of units have failed to receive credits, over a considerable time period, due to errors. I will restart Rosetta after purging all current files and if that does not fix the waste of resources I will remove the project from my machines. Can you please explain why these problem files manifest themselves?
[img]http://boincstats.com/signature/-1/user/3453400/sig.png[img]
ID: 70905 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 70906 - Posted: 4 Aug 2011, 19:02:52 UTC - in response to Message 70904.  



on a rare occasion the tasks go into loop mode. also boinc mgr monitors the program and the task has a code built into that when the time goes to long it is suppose to "call" the end of task and report back. But it sounds like something got out of place and it just looped.

What is your run time for R@H? 4,8 or longer hrs?


The program will/should run until your time limit is up and then stop crunching and compile and upload the information. The reporting you can do manually or it will do it itself automatically at a later time.



I did not set a run time for R@H. I did not realize this was required.

I figured a unit will run to completion. NO?

Why would I have to set this? Where would I find this setting?


you can limit the time or extend the time beyond the "default" setting of the task by going into your account.

you can do this by looking at the top, click on participants, got near the bottom and look for in bold print the word preferences (left hand side of screen). within that block is a line called Resource share and graphics (on left side) and Rosetta@home preferences to the right of this. click on the rosetta prefs. then look for Target CPU run time. right below that is "edit" and you can click on that. There you will see a drop down box, options range from not selected to 1 day. Set yours to whatever you want. You can throttle the time of the task to 1hr or 4hrs or extend it out to 1 day of total run time.

you can also change your resource share here as well.

when your done, click the update preferences button (you will get a message about the changes wont take place until you communicate with the project) and then goto your boinc manager and click on the projects tab and select rosetta and then click on the update button. then boinc manager has the latest info.
you can then exit that page and come back here to discussion boards or whatever. All Boinc projects should have this feature in your account page.

also have a look in R@H computing preferences which can be found via the same block of preference lines i mentioned earlier.

also in boinc manager you can play around with the settings of how many days of extra work you want to keep on your system and also stuff relating to your processor usage and disk and memory usage. just goto tools and computing preferences and change what you want in these categories.

all these things help customize boinc and the projects to your specifications.
ID: 70906 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 70907 - Posted: 4 Aug 2011, 19:03:05 UTC
Last modified: 4 Aug 2011, 19:39:29 UTC

when was then last time we had a post from the project about these problems?...maybe everyone took a vacation at the same time
ID: 70907 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Ed

Send message
Joined: 2 Aug 11
Posts: 31
Credit: 662,563
RAC: 0
Message 70909 - Posted: 4 Aug 2011, 21:56:07 UTC - in response to Message 70906.  


you can limit the time or extend the time beyond the "default" setting of the task by going into your account.



Thanks for the info. I now understand the what and the how but I still don't grasp the why.

If I let the Rosetta tasks run as much as they like, are you saying they will run forever?

If I limit them to 8 hours, won't that stop the process before it is done?

What is the impact of limiting the run time in relation to the work I am doing for the project. Won't the WU be incomplete?
ID: 70909 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,797,168
RAC: 648
Message 70910 - Posted: 4 Aug 2011, 22:14:31 UTC - in response to Message 70904.  



on a rare occasion the tasks go into loop mode. also boinc mgr monitors the program and the task has a code built into that when the time goes to long it is suppose to "call" the end of task and report back. But it sounds like something got out of place and it just looped.

What is your run time for R@H? 4,8 or longer hrs?


The program will/should run until your time limit is up and then stop crunching and compile and upload the information. The reporting you can do manually or it will do it itself automatically at a later time.



I did not set a run time for R@H. I did not realize this was required.

I figured a unit will run to completion. NO?

Why would I have to set this? Where would I find this setting?


There is default of three hours. There is no requirement that you set the run time; most crunchers probably don't even know the option exists. This setting is unrelated to the problem you experienced.

Examining the task details page for the workunit you aborted you can see that it only ran for 1207.12 cpu seconds. This is so far less than the elapsed time you report seeing in the manager that I believe the work unit was stuck, stuck meaning that, unknown to BOINC, the application had stopped running. It could be the application experienced a fatal error but failed to tell boinc. Or maybe it told boinc but boinc failed to respond. Maybe one, the other, both or something altogether different was/were triggered by some other process on your machine. Point being it's a problem not exclusive to rosetta that has proven difficult to resolve, partly because it's rather rare. It's terribly unfortunate that your first work unit was slayed by this demon.

If you see the same behavior on your next work unit check with your task manager to see if rosetta (not just boinc) is still using the cpu. If not quit boinc completely (don't just close the window) and restart. The workunit will either pick up from it's last checkpoint or it may error out immediately. Please report back with the details.


FYI:

The project (rosetta, SETI) application does not create the time displays presented in the boinc manger, rather the BOINC manager estimates the time to completion from information it receives from the application. Your first work unit would have come to you with an estimated run time of 3 cpu hours. That's cpu hours not wall clock time. The "to completion" and "elapsed" figures shown in the boinc manager are both wall clock time. If you highlight a task and click on "Properties" you'll see both cpu and elapsed time. On rosetta you can also see the cpu time in the graphics window provided the task is actually running.

The watchdog Greg mentioned would have kicked in after 7 hours if the work unit had been running but if I 'm correct it wasn't receiving cpu time thus the app was running and it's built in watchdog wasn't running.


Best,
Snags
ID: 70910 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TPCBF

Send message
Joined: 29 Nov 10
Posts: 109
Credit: 4,627,580
RAC: 1,835
Message 70911 - Posted: 4 Aug 2011, 23:56:47 UTC - in response to Message 70904.  

I did not set a run time for R@H. I did not realize this was required.
It isn't required...
I figured a unit will run to completion. NO?
Unless there's apparently a bad batch of WU's, it will...

Ralf
ID: 70911 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,797,168
RAC: 648
Message 70912 - Posted: 5 Aug 2011, 0:49:54 UTC - in response to Message 70909.  


you can limit the time or extend the time beyond the "default" setting of the task by going into your account.



Thanks for the info. I now understand the what and the how but I still don't grasp the why.

If I let the Rosetta tasks run as much as they like, are you saying they will run forever?

If I limit them to 8 hours, won't that stop the process before it is done?

What is the impact of limiting the run time in relation to the work I am doing for the project. Won't the WU be incomplete?


You posted this while I was still composing my earlier post. I keep saying I need to learn to write faster but that means thinking faster and whatever benefits come with growing older I think I've reached the age where thinking faster is not going to be one of them. On to your questions:

First, no they won't run forever (except rarely, when something is very badly broken) and no matter the preferred run time models will be completed and useful information sent back to the project (except in those rare cases that combine a slow pc, a low run time and a particularly challenging work unit).

Rosetta completes as many models as it can within the preferred run time specified. Note the word preferred. It's not a hard and fast time limit; it won't stop computing in the middle of a model. Rather the clever folks here have written application code that checks how long each model is taking and won't start a new model if it would increase the total run time beyond the preferred limit. They do however want at least one complete model from every work unit and so will keep going past the preferred run time if necessary. If the model is still not complete after an extra four hours the watchdog kicks in to stop the crunching. It assumes that something has gone wrong and doesn't want to waste any more of your time with that particular work unit. This means that most of your work units will end just under the preferred run time with occasional exceptions running over. To add a little more detail, the models don't take the exact same amount of time to complete even models within the same work unit so when the app thinks it has time to run another model sometime it's just, um, wrong. And sometimes a model will run much, much longer than the others. Sometimes this is unexpected and puzzling (see the old "long-running model" threads) and sometimes it's hoped for and interesting (protein-protein interface tasks). There have also been some work units that ended after completing a single model and others that end after completing 100 models even if that meant ending them well short of the time limit. I believe the 100 model limit was imposed to deal with the very large files being created by those particular types of work units. I haven't seen anything from the project about the short single model work units so I can't say if that is intentional or in error. Even though this mix of designs can look quite confusing on our end I remain confident that the team is getting the maximum amount of usable results while using our resources as efficiently as possible.

I hope that reassures you that it will be perfectly fine if you never fiddle with the preferred run time preference.

Why might you want to? Well, think of it this way, at the default setting for every 3 hours of cpu time spent crunching boinc will have to contact the project, download, start up the new task, zip up the finished task, upload.
By increasing the preferred run time you increase the time spent crunching models for the same amount (or very nearly the same amount) of admin. Folks with longer run times often continue crunching rosetta when downtimes on the projects servers have other people complaining of no work. You can in effect increase your cache without actually increasing your cache, particularly useful if you have flakey internet service. And if you are churning through many work units every day you can decrease this (and the load on the project servers) by increasing the run time per work unit. A couple of caveats; the longer the preferred run time the longer before you or the watchdog will know if there is a problem, and any runtimes set at the extremes will result in less accurate time to completion estimates by boinc. (And for any of you folks who are churning through lots a work units and want to try changing your preferred run time please be advised: a sudden dramatic change can lead to boinc putting the work units into high priority mode and possibly missed deadlines. Make the change incrementally being sure to allow a couple of work units to complete and report between each increase. Obviously the larger the cache the more cautious you should be.)



Best,
Snags

ID: 70912 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Ed

Send message
Joined: 2 Aug 11
Posts: 31
Credit: 662,563
RAC: 0
Message 70913 - Posted: 5 Aug 2011, 2:31:33 UTC

I want to thank eveyone for your patient answers to my questions. I always want to understand the why as much as the what and how.

The work unit I have not seems to be running in a more normal fashion.

Started at 4:17

Elapsed is now 3:54 and 57.6% done. 2:40 to go.

so I will set my preference to 6 hours and most WU should be able to run to completion. Better to get more work out of fewer work units.

Thanks everyone.

Ed
ID: 70913 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 70914 - Posted: 5 Aug 2011, 3:08:35 UTC

thanks Snags for the more detailed and accurate explanation.
I knew it was something like that...

greg
ID: 70914 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 70915 - Posted: 5 Aug 2011, 3:45:15 UTC - in response to Message 70907.  

When was then last time we had a post from the project about these problems? ...maybe everyone took a vacation at the same time

Nothing since we were told there'd be no work until early next week.

The key paragraph said:
The running of the jobs on R@h is only one step in the process - it takes a while to figure out what sorts of jobs will give usable scientific results, to set up the jobs, test them to make sure they won't cause a huge failure rate, and then at the end of the runs to process the results to figure out what the next round should do. Usually we have enough things going on that the computational lull in one project will be covered by the compute phase of a different one. We just happen to have hit a point where none of the currently active projects is in an active compute phase.


I really don't know what else people need to be updated with in the meantime. Just read the above again, I guess. There's plenty of work out there if people have a localised shortfall, but seeing as this is the first occasion I've seen an issue of this particular kind in 3.5 years I'm not personally bothered.
ID: 70915 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Ed

Send message
Joined: 2 Aug 11
Posts: 31
Credit: 662,563
RAC: 0
Message 70919 - Posted: 5 Aug 2011, 12:09:08 UTC - in response to Message 70915.  

[quote]When was then last time we had a post from the project about these problems? ...maybe everyone took a vacation at the same time

Nothing since we were told there'd be no work until early next week.

So, if everyone let the WU run longer, more work would be done with fewer WU and we would not run out of Rosetta tasks as often. This would make the best use of the admin's time, the server resources and probably give the scientists more bang for our efforts.

It would seem to be that people would want to set their run times longer than 3 hours making every WU really count.
ID: 70919 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,797,168
RAC: 648
Message 70922 - Posted: 5 Aug 2011, 13:42:34 UTC - in response to Message 70913.  

I want to thank eveyone for your patient answers to my questions. I always want to understand the why as much as the what and how.

The work unit I have not seems to be running in a more normal fashion.

Started at 4:17

Elapsed is now 3:54 and 57.6% done. 2:40 to go.



so I will set my preference to 6 hours and most WU should be able to run to completion. Better to get more work out of fewer work units.

Thanks everyone.

Ed



These numbers look reasonable to me. Remember the elapsed time and time to completion seen in your manager is in wallclock time. The preferred run time preference is for cpu time. BOINC is designed to run on your computer at the lowest priority and will get out of the way, i.e. stop using the cpu (and thus stop accumulating cpu time) whenever the computer says it needs a bit more of the cpu. This may only be for a few fractions of a second and the work unit will still be running in the BOINC manager and the elapsed time will continue to increase. Even if you set the boinc preferences "Use at most x% of CPU time" to 100, at the completion of any work unit the cpu time used will be at least fractionally smaller than the elapsed time. (I have set that preference to 100, I have a 12 hour preferred run time, elapsed time can be anywhere from 5 minutes to 4+ hours longer.)If you have a 3 hour preferred run time the application will use about 3 hours of cpu time regardless of how many hours of elapsed time that might take.

In the task tab of your manager highlight the rosetta task then click on the "Properties" link (in the Commands list on the left) and you will see both cpu and wallclock (elapsed) time listed.

Best,
Snags
ID: 70922 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,797,168
RAC: 648
Message 70924 - Posted: 5 Aug 2011, 15:19:59 UTC - in response to Message 70905.  

We are aware that we have had some issues with bad jobs on Rosetta@home recently. We try to ensure that these bad jobs don't slip through, but they occasionally do. When that happens, your efforts to alert us to these problems are extremely important and very much appreciated.


My average credit is down and checking completed units showed tens of units have failed to receive credits, over a considerable time period, due to errors. I will restart Rosetta after purging all current files and if that does not fix the waste of resources I will remove the project from my machines. Can you please explain why these problem files manifest themselves?


This is difficult to answer. Your task page shows lots of download errors and client errors that are, in effect, download errors ("couldn't start Input file minirosetta_database_rev42272.zip missing or invalid"). These files were subsequently sent to other crunchers who completed them successfully which suggests that the problem is not with the files themselves but something happens to them after they leave the rosetta server.

Resetting the project or cleaning out even more completely in the manner Greg has suggested may well do the trick removing some some corruption and setting you are your merry way again.

Or: The messages you received are also compatible with a virus scanner or firewall blocking the files. It's not unheard of for a virus scanner to snag the files of one project while letting the files of another project run along unmolested. Or to suddenly start having issues with a project's apps when previously they were deemed safe.

If Greg's clean up prescription doesn't help and you are certain your firewall isn't blocking rosetta files and your security scanners aren't holding them hostage then hopefully someone (familiar with Windows Vista) has some troubleshooting steps that can locate the source of the problem.


Best,
Snags
ID: 70924 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 70925 - Posted: 5 Aug 2011, 18:56:36 UTC - in response to Message 70913.  

I want to thank eveyone for your patient answers to my questions. I always want to understand the why as much as the what and how.

The work unit I have not seems to be running in a more normal fashion.

Started at 4:17

Elapsed is now 3:54 and 57.6% done. 2:40 to go.

so I will set my preference to 6 hours and most WU should be able to run to completion. Better to get more work out of fewer work units.

Thanks everyone.

Ed

Did you set the "While processor usage is less than X %" to 0%?
ID: 70925 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org