Posts by Charles Dennett

1) Message boards : Number crunching : Large proteins (Message 95060)
Posted 21 Apr 2020 by Profile Charles Dennett
Post:
Have the larger ones started coming out? This beast ran for 18 hours and only had one decoy: https://boinc.bakerlab.org/rosetta/result.php?resultid=1156094562


Same here for me, 1 decoy in 14hs


I believe both of those were ended by the watchdog after going 10 hours (36000 seconds) over your selected run times:

BOINC:: CPU time: 65043.4s, 36000s + 28800s[2020- 4-21 8:57: 3:] :: BOINC
Output exists: default.out.gz Size: WARNING! cannot get file size for default.out.gz: could not open file.
-1

BOINC:: CPU time: 50670.5s, 36000s + 14400s[2020- 4-21 10:22:32:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1

-Charlie
2) Message boards : Number crunching : Large proteins (Message 94898)
Posted 19 Apr 2020 by Profile Charles Dennett
Post:
Will there be a way to identify these long running tasks based on their name?
3) Message boards : Number crunching : Discussion on increasing the default run time (Message 94692)
Posted 17 Apr 2020 by Profile Charles Dennett
Post:
There is apparently an issue with some (not all) tasks whose name starts with 12v1n. I had one go over 36 hours before I finally aborted it. Other have reported similar issues. However, not all tasks that start 12v1n have the issue. Just abort it if it runs past where the watchdog would abort it.

-Charlie
4) Message boards : Number crunching : Rosetta 4.1+ and 4.2+ (Message 94691)
Posted 17 Apr 2020 by Profile Charles Dennett
Post:
Yeah, I had a similar one whose name started the same way as yours. It ran over a day and a half before I aborted it. Apparently others have reported issues with tasks with names like that. Just abort it.

-Charlie
5) Message boards : Number crunching : Watchdog not working too well (Message 94669)
Posted 17 Apr 2020 by Profile Charles Dennett
Post:
After over a day and a half of cpu time I've aborted the task.

-Charlie
6) Message boards : Number crunching : Watchdog not working too well (Message 94626)
Posted 16 Apr 2020 by Profile Charles Dennett
Post:
I'm going to let it run for a while just to see what happens.


You are much more curious than I :) I would blast it. But either way, please post an update when it reports back. I am curious too. Perhaps your dog (your profile photo) can help teach the R@h watchdog.

Have you verified the venue of the host as compared to the runtime preference for that venue? Have you been modifying the runtime preferences recently?


I don't use a specific venue. A couple of weeks ago I raised the run time from 1 hour to 8 hours and ran that way for a while. I noticed my RAC started dropping so several days ago I lowered it to 2 hours to see if by any chance it would make a difference (not that I expect it to). The task was received well after I did that and was preceded by a lot of tasks that ran successfully with the 2 hour cpu time. So, I doubt that would have been the reason. Still, with an 8 hour run time the watchdog would have aborted it after 12 hours.

Unfortunately, the dog in my profile is no longer with us. I'll have to get a picture of my new yellow lab. I ran R@H for a long time but stopped several years ago with all distributed computing. I retired a year ago and with the recent pandemic I jumped back in to to my part. R@H was always one of my favorites. Right now it's all I'm doing across 3 systems/12 cores.

-Charlie
7) Message boards : Number crunching : Watchdog not working too well (Message 94622)
Posted 16 Apr 2020 by Profile Charles Dennett
Post:
Have a task that is on a 2 hour run time target. The watchdog should have stopped it at 6 CPU hours. Currently it is over 21 hours of cpu time:

Application Rosetta 4.15
Name 12v1n_al_12mer_design_00240_010210_0001_SAVE_ALL_OUT_914331_72
State Running
Received Wed 15 Apr 2020 12:05:07 PM EDT
Report deadline Sat 18 Apr 2020 12:05:06 PM EDT
Estimated computation size 80,000 GFLOPs
CPU time 21:22:39
CPU time since checkpoint 21:22:39
Elapsed time 21:39:20
Estimated time remaining 00:10:07
Fraction done 99.226%
Virtual memory size 382.00 MB
Working set size 304.89 MB
Directory slots/3
Process ID 164535
Progress rate 4.680% per hour
Executable rosetta_4.15_x86_64-pc-linux-gnu

Also note that it has not checkpointed yet either. Looking at files in the slots/3 directory does show some current activity (current time at my location on 13:34 as I type this):

ls -lart | tail
-rw-r--r--. 1 boinc boinc 0 Apr 15 15:50 rosetta_tmp.txt
-rw-r--r--. 1 boinc boinc 0 Apr 15 15:50 minirosetta_database.zip.is_extracted
-rw-rw-r--. 1 charlie charlie 0 Apr 16 06:57 stderrgfx.txt
-rw-rw-r--. 1 charlie charlie 14 Apr 16 06:57 gfx_info
-rw-r--r--. 1 boinc boinc 6175 Apr 16 11:28 init_data.xml
drwxrwx--x. 3 boinc boinc 20480 Apr 16 11:28 .
-rw-r--r--. 1 boinc boinc 9529 Apr 16 13:30 12v1n_al_12mer_design_00240_010210_0001_check.txt
-rw-r--r--. 1 boinc boinc 3589 Apr 16 13:30 rng.state.gz
-rw-rw----. 1 boinc boinc 25001680 Apr 16 13:33 boinc_rosetta_3
-rw-r--r--. 1 boinc boinc 8192 Apr 16 13:33 boinc_mmap_file

A tail of the 12v1n_al_12mer_design_00240_010210_0001_check.txt file shows this:

tail 12v1n_al_12mer_design_00240_010210_0001_check.txt
LAST 497 SUCCESS 0
LAST 498 SUCCESS 0
LAST 499 SUCCESS 0
LAST 500 SUCCESS 0
LAST 501 SUCCESS 0
LAST 502 SUCCESS 0
LAST 503 SUCCESS 0
LAST 504 SUCCESS 0
LAST 505 SUCCESS 0
LAST 506 SUCCESS 0

Here's a link to the task:
https://boinc.bakerlab.org/rosetta/result.php?resultid=1150908452

I'm going to let it run for a while just to see what happens.

-Charlie
8) Message boards : Number crunching : Is Rosetta still issuing COVID-19 relataed workunits? (Message 92726)
Posted 31 Mar 2020 by Profile Charles Dennett
Post:

I just have one small issue. Is anyone getting absurdly low credits for those tasks? I've set the run-time to 24 hours and am getting less than 300 credits per core per 24 hour on a AMD Ryzen 3600, which is about 10 credits per decoy generated. One of the latest COVID-19 granted me a whopping 58 credits for 28 decoys generated...
http://boinc.bakerlab.org/rosetta/result.php?resultid=1136826641


Just noticed the same thing. Credit of 8 to 12 for jobs running four hours. Normally jobs on my systems get 100-400 credit for the typical jobs.

Charlie
9) Message boards : Number crunching : Linux Hung Machine (Message 92334)
Posted 26 Mar 2020 by Profile Charles Dennett
Post:
Excuse the bogus signature. Back crunching after being away for a few years. Fixed it in my profile. Now to go fix it in my forum signature.

<edit>Fixed</edit>
10) Message boards : Number crunching : Linux Hung Machine (Message 92333)
Posted 26 Mar 2020 by Profile Charles Dennett
Post:
I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help.

Thanks,
Greg


Possible memory/swap issues? Maybe the machine is starting to use a good amount of swap space? How much memory does the machine have?
Charlie
11) Questions and Answers : Wish list : Resource share can not be set to 0 (Message 78789)
Posted 14 Sep 2015 by Profile Charles Dennett
Post:
This past weekend I tried to do this - set resource share to zero. However, the boinc software on the Rosetta servers is too old to allow this. Newer versions of the software allow this. When the resource share is set to zero, a system will only download a rosetta task if no other tasks from other projects the system is connected to are available, and then only download one task per processor core. That way one can use the project as a backup. The project I've been crunching for recently went down unexpectedly over the weekend and I thought I'd add Rosetta to my list of backup projects. Rosetta allowed me to set the resource share to 1E-99, but still wanted to download many tasks at a time.

So, are there any plans to upgrade the software on the rosetta servers to something more recent? I'm sure there are a lot of other new features that might be worthwhile in the newer software.

- Charlie
12) Message boards : Number crunching : Cannot retrieve new work (Message 77302)
Posted 7 Aug 2014 by Profile Charles Dennett
Post:
lol! over 1.1 million tasks in progress. I can not get any new task for my hosts.


2 million in progress now. I just hope most of them come back completed rather than hitting 10 day expiration.


Curious as to how you get those numbers. (of course, you're on the inside so may have access to better info than we mere mortals do :-)

From the home page in the upper right corner I see this:


Server Status as of 7 Aug 2014 16:07:41 UTC
[ Scheduler running ]
Total queued jobs: 378,664
In progress: 916,437

Then from the server status page, I see this:

As of 7 Aug 2014 17:31:52 UTC
State Approximate #results
Ready to send 15,330
In progress 691,956

Are the Total Queued jobs and ready to send supposed to be the same? I realize the times these numbers are generated are different, but I would expect them to be close.

Thanks for any insight.

Charlie

13) Message boards : Number crunching : Cannot retrieve new work (Message 77292)
Posted 6 Aug 2014 by Profile Charles Dennett
Post:
I'm seeing a slightly different problem. From my boinc logs:

06-Aug-2014 04:30:57 [rosetta@home] Sending scheduler request: To fetch work.
06-Aug-2014 04:30:57 [rosetta@home] Requesting new tasks for CPU
06-Aug-2014 04:31:00 [rosetta@home] Scheduler request completed: got 0 new tasks
06-Aug-2014 04:31:00 [rosetta@home] No work sent


This is happening more often than not on all 4 of my systems. They try to get new work but get nothing. I also crunch for WCG and POEM (on my one 64 bit Linux box - all my systems run Linux) so I have work, just not rossetta. Once in a while I'll get a rosetta workunit but not often.

Charlie
14) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 77222)
Posted 3 Aug 2014 by Profile Charles Dennett
Post:
Thanks for the update. Since I crunch for other life science projects, I always have work. Hopefully things will settle out shortly.

I am curious about one thing. You said there was a large increase in users/computers. Do you mean regular crunchers like us or others who use the results we produce?
15) Message boards : Number crunching : run time is set for 4 hrs but tasks are running 6 hrs (Message 58681)
Posted 8 Jan 2009 by Profile Charles Dennett
Post:
I want to start out by saying this is entirely based on my observations over the years of Boinc and how it behaves.


I've seen the situation where within one workunit the models take quite different times to complete. So, a WU starts running. A model completes. Boinc then says "Based on the time used so far and the number of models I've completed and the time remaining, can I squeeze in another model?" If the answer is no, it decides the workunit is done, reports the results back and goes on the the next workunit. If the answer is yes, then it starts another model and then repeats the process.

However, lets say you have a 4 hours time limit. You've completed some number of models in the currently running workunit. The average time for completion for these models is 30 minutes each. You have a bit over 30 minutes left in your 4 hour time limit. So, Boinc figures it can complete another model. However, for some reason this model takes 2 hours to complete. What happens? The model runs to completion and the time for the workunit comes out to be close to 5 1/2 hours. That's way past your 4 hour run time but Boinc had no way of knowing the last model would take so long.

Also, when this happens, Boinc recalculates the run time of the other workunits in your queue. Typically from what I've observed, if it needs to raise the estimate, it raises it to the run time of the workunit that just finished. If it wants to lower it, it lowers it a bit at a time. (Actually what it is doing is recalculating a number called the duration correction factor. It also does this after each workunit completes, but usually the newly calculated runtimes are close to their previous values because the actual runtime was close to the originally estimated runtime.)

Charlie
16) Message boards : Number crunching : Rosetta server(s) clocks wrong? (Message 57618)
Posted 5 Dec 2008 by Profile Charles Dennett
Post:
I think one or more R@H server clocks may be off. I've noticed this the past few days. So, I just forced an update from one of my systems and watched the list for my computers on the R@H website for the time last contacted. After taking into account the time zone difference between UTC and my time zone, it looks like the clocks are off by about 15 minutes.

17) Message boards : Number crunching : Problems with web site (Message 57280)
Posted 27 Nov 2008 by Profile Charles Dennett
Post:
recently the text formatting of the forums does not work properly. The text is going from side to side on full screen. You have to center the text in your screen and set the user details off the screen in order to read the post.

previously the text autoformated so that it fit to the window without resizing and without having to center the text via the scroll bar.


The text formatter will break up strings of words at logical places. Usually that's at the spaces between words, although I've see it also break at hyphens. However, when someone enters a very long string of characters, like one of those super long workunit names that has no spaces or hyphens in it, it will cause the text to extend beyond the usual right margin and resets the right margin out further to accommodate this long string. Since there can only be one right margin on the page, everything now goes out to the new right margin.

18) Message boards : Number crunching : Rosetta Stalls in the last 5% (Message 56477)
Posted 25 Oct 2008 by Profile Charles Dennett
Post:
Because of the varying times, I pretty much ignore the estimated time to completion and the % complete. Rosetta really can't tell how much work it has left to do on a model so those numbers are sometimes way off. Let it go and Boinc/Rosetta will do the right thing.
19) Message boards : Number crunching : Preemption Failures on Linux (Message 47580)
Posted 10 Oct 2007 by Profile Charles Dennett
Post:

Does anyone have a machine or Linux version where they do not seem to encounter these problems?

Does the version of BOINC have any impact on seeing problems occur more or less frequently?

When a task "freezes", does it still use CPU? What status does BOINC show on the task?

Has a thourough memory test been run on the machine? What % of memory is BOINC allowed to use (see your General Preferences)? And how much memory is on the machine?

In the machine overclocked?

Are some task names more likely to see a problem then others?



I've been running boinc for quite a while on my home Linux box. It's running fedora 5 with BOINC 5.10.8. It is not overclocked. The MB is an ASUS A7V8X with an AMD XP 2600+ processor. It has 1 GB of memory and prefereces are for 50% when in use and 90% when idle. This machine also runs my website.

The darn thing is pretty much rock solid. I rarely have any aborts, freezes, or other problems with RAH or any other project. Right now I'm crunching SIMAP, but when that project does not have work, it crunches RAH. (SIMAP typically has 2-5 days worth of work at the beginning of each month, but this month is unusual in that it was a ton of new work.) So, if you check my results (computers are visible, I believe) you might not see too many results listed.

The only time I've seen problems is when my cable connection or router has a problem. Then BOINC can't phone home and results (not just RAH) seem to abort at times.

Charlie
20) Message boards : Number crunching : Windows Vista Defender annoys... (Message 47380)
Posted 4 Oct 2007 by Profile Charles Dennett
Post:
I installed it as a service on a Vista laptop and it works fine that way for me.


Next 20



©2024 University of Washington
https://www.bakerlab.org