1)
Message boards :
Number crunching :
Large proteins
(Message 95060)
Posted 21 Apr 2020 by Charles Dennett Post: Have the larger ones started coming out? This beast ran for 18 hours and only had one decoy: https://boinc.bakerlab.org/rosetta/result.php?resultid=1156094562 I believe both of those were ended by the watchdog after going 10 hours (36000 seconds) over your selected run times: BOINC:: CPU time: 65043.4s, 36000s + 28800s[2020- 4-21 8:57: 3:] :: BOINC Output exists: default.out.gz Size: WARNING! cannot get file size for default.out.gz: could not open file. -1 BOINC:: CPU time: 50670.5s, 36000s + 14400s[2020- 4-21 10:22:32:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 -Charlie |
2)
Message boards :
Number crunching :
Large proteins
(Message 94898)
Posted 19 Apr 2020 by Charles Dennett Post: Will there be a way to identify these long running tasks based on their name? |
3)
Message boards :
Number crunching :
Discussion on increasing the default run time
(Message 94692)
Posted 17 Apr 2020 by Charles Dennett Post: There is apparently an issue with some (not all) tasks whose name starts with 12v1n. I had one go over 36 hours before I finally aborted it. Other have reported similar issues. However, not all tasks that start 12v1n have the issue. Just abort it if it runs past where the watchdog would abort it. -Charlie |
4)
Message boards :
Number crunching :
Rosetta 4.1+ and 4.2+
(Message 94691)
Posted 17 Apr 2020 by Charles Dennett Post: Yeah, I had a similar one whose name started the same way as yours. It ran over a day and a half before I aborted it. Apparently others have reported issues with tasks with names like that. Just abort it. -Charlie |
5)
Message boards :
Number crunching :
Watchdog not working too well
(Message 94669)
Posted 17 Apr 2020 by Charles Dennett Post: After over a day and a half of cpu time I've aborted the task. -Charlie |
6)
Message boards :
Number crunching :
Watchdog not working too well
(Message 94626)
Posted 16 Apr 2020 by Charles Dennett Post: I'm going to let it run for a while just to see what happens. I don't use a specific venue. A couple of weeks ago I raised the run time from 1 hour to 8 hours and ran that way for a while. I noticed my RAC started dropping so several days ago I lowered it to 2 hours to see if by any chance it would make a difference (not that I expect it to). The task was received well after I did that and was preceded by a lot of tasks that ran successfully with the 2 hour cpu time. So, I doubt that would have been the reason. Still, with an 8 hour run time the watchdog would have aborted it after 12 hours. Unfortunately, the dog in my profile is no longer with us. I'll have to get a picture of my new yellow lab. I ran R@H for a long time but stopped several years ago with all distributed computing. I retired a year ago and with the recent pandemic I jumped back in to to my part. R@H was always one of my favorites. Right now it's all I'm doing across 3 systems/12 cores. -Charlie |
7)
Message boards :
Number crunching :
Watchdog not working too well
(Message 94622)
Posted 16 Apr 2020 by Charles Dennett Post: Have a task that is on a 2 hour run time target. The watchdog should have stopped it at 6 CPU hours. Currently it is over 21 hours of cpu time: Application Rosetta 4.15 Name 12v1n_al_12mer_design_00240_010210_0001_SAVE_ALL_OUT_914331_72 State Running Received Wed 15 Apr 2020 12:05:07 PM EDT Report deadline Sat 18 Apr 2020 12:05:06 PM EDT Estimated computation size 80,000 GFLOPs CPU time 21:22:39 CPU time since checkpoint 21:22:39 Elapsed time 21:39:20 Estimated time remaining 00:10:07 Fraction done 99.226% Virtual memory size 382.00 MB Working set size 304.89 MB Directory slots/3 Process ID 164535 Progress rate 4.680% per hour Executable rosetta_4.15_x86_64-pc-linux-gnu Also note that it has not checkpointed yet either. Looking at files in the slots/3 directory does show some current activity (current time at my location on 13:34 as I type this): ls -lart | tail -rw-r--r--. 1 boinc boinc 0 Apr 15 15:50 rosetta_tmp.txt -rw-r--r--. 1 boinc boinc 0 Apr 15 15:50 minirosetta_database.zip.is_extracted -rw-rw-r--. 1 charlie charlie 0 Apr 16 06:57 stderrgfx.txt -rw-rw-r--. 1 charlie charlie 14 Apr 16 06:57 gfx_info -rw-r--r--. 1 boinc boinc 6175 Apr 16 11:28 init_data.xml drwxrwx--x. 3 boinc boinc 20480 Apr 16 11:28 . -rw-r--r--. 1 boinc boinc 9529 Apr 16 13:30 12v1n_al_12mer_design_00240_010210_0001_check.txt -rw-r--r--. 1 boinc boinc 3589 Apr 16 13:30 rng.state.gz -rw-rw----. 1 boinc boinc 25001680 Apr 16 13:33 boinc_rosetta_3 -rw-r--r--. 1 boinc boinc 8192 Apr 16 13:33 boinc_mmap_file A tail of the 12v1n_al_12mer_design_00240_010210_0001_check.txt file shows this: tail 12v1n_al_12mer_design_00240_010210_0001_check.txt LAST 497 SUCCESS 0 LAST 498 SUCCESS 0 LAST 499 SUCCESS 0 LAST 500 SUCCESS 0 LAST 501 SUCCESS 0 LAST 502 SUCCESS 0 LAST 503 SUCCESS 0 LAST 504 SUCCESS 0 LAST 505 SUCCESS 0 LAST 506 SUCCESS 0 Here's a link to the task: https://boinc.bakerlab.org/rosetta/result.php?resultid=1150908452 I'm going to let it run for a while just to see what happens. -Charlie |
8)
Message boards :
Number crunching :
Is Rosetta still issuing COVID-19 relataed workunits?
(Message 92726)
Posted 31 Mar 2020 by Charles Dennett Post:
Just noticed the same thing. Credit of 8 to 12 for jobs running four hours. Normally jobs on my systems get 100-400 credit for the typical jobs. Charlie |
9)
Message boards :
Number crunching :
Linux Hung Machine
(Message 92334)
Posted 26 Mar 2020 by Charles Dennett Post: Excuse the bogus signature. Back crunching after being away for a few years. Fixed it in my profile. Now to go fix it in my forum signature. <edit>Fixed</edit> |
10)
Message boards :
Number crunching :
Linux Hung Machine
(Message 92333)
Posted 26 Mar 2020 by Charles Dennett Post: I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help. Possible memory/swap issues? Maybe the machine is starting to use a good amount of swap space? How much memory does the machine have? Charlie |
11)
Questions and Answers :
Wish list :
Resource share can not be set to 0
(Message 78789)
Posted 14 Sep 2015 by Charles Dennett Post: This past weekend I tried to do this - set resource share to zero. However, the boinc software on the Rosetta servers is too old to allow this. Newer versions of the software allow this. When the resource share is set to zero, a system will only download a rosetta task if no other tasks from other projects the system is connected to are available, and then only download one task per processor core. That way one can use the project as a backup. The project I've been crunching for recently went down unexpectedly over the weekend and I thought I'd add Rosetta to my list of backup projects. Rosetta allowed me to set the resource share to 1E-99, but still wanted to download many tasks at a time. So, are there any plans to upgrade the software on the rosetta servers to something more recent? I'm sure there are a lot of other new features that might be worthwhile in the newer software. - Charlie |
12)
Message boards :
Number crunching :
Cannot retrieve new work
(Message 77302)
Posted 7 Aug 2014 by Charles Dennett Post: lol! over 1.1 million tasks in progress. I can not get any new task for my hosts. Curious as to how you get those numbers. (of course, you're on the inside so may have access to better info than we mere mortals do :-) From the home page in the upper right corner I see this: Server Status as of 7 Aug 2014 16:07:41 UTC [ Scheduler running ] Total queued jobs: 378,664 In progress: 916,437 Then from the server status page, I see this: As of 7 Aug 2014 17:31:52 UTC State Approximate #results Ready to send 15,330 In progress 691,956 Are the Total Queued jobs and ready to send supposed to be the same? I realize the times these numbers are generated are different, but I would expect them to be close. Thanks for any insight. Charlie |
13)
Message boards :
Number crunching :
Cannot retrieve new work
(Message 77292)
Posted 6 Aug 2014 by Charles Dennett Post: I'm seeing a slightly different problem. From my boinc logs: 06-Aug-2014 04:30:57 [rosetta@home] Sending scheduler request: To fetch work. 06-Aug-2014 04:30:57 [rosetta@home] Requesting new tasks for CPU 06-Aug-2014 04:31:00 [rosetta@home] Scheduler request completed: got 0 new tasks 06-Aug-2014 04:31:00 [rosetta@home] No work sent This is happening more often than not on all 4 of my systems. They try to get new work but get nothing. I also crunch for WCG and POEM (on my one 64 bit Linux box - all my systems run Linux) so I have work, just not rossetta. Once in a while I'll get a rosetta workunit but not often. Charlie |
14)
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
(Message 77222)
Posted 3 Aug 2014 by Charles Dennett Post: Thanks for the update. Since I crunch for other life science projects, I always have work. Hopefully things will settle out shortly. I am curious about one thing. You said there was a large increase in users/computers. Do you mean regular crunchers like us or others who use the results we produce? |
15)
Message boards :
Number crunching :
run time is set for 4 hrs but tasks are running 6 hrs
(Message 58681)
Posted 8 Jan 2009 by Charles Dennett Post: I want to start out by saying this is entirely based on my observations over the years of Boinc and how it behaves. I've seen the situation where within one workunit the models take quite different times to complete. So, a WU starts running. A model completes. Boinc then says "Based on the time used so far and the number of models I've completed and the time remaining, can I squeeze in another model?" If the answer is no, it decides the workunit is done, reports the results back and goes on the the next workunit. If the answer is yes, then it starts another model and then repeats the process. However, lets say you have a 4 hours time limit. You've completed some number of models in the currently running workunit. The average time for completion for these models is 30 minutes each. You have a bit over 30 minutes left in your 4 hour time limit. So, Boinc figures it can complete another model. However, for some reason this model takes 2 hours to complete. What happens? The model runs to completion and the time for the workunit comes out to be close to 5 1/2 hours. That's way past your 4 hour run time but Boinc had no way of knowing the last model would take so long. Also, when this happens, Boinc recalculates the run time of the other workunits in your queue. Typically from what I've observed, if it needs to raise the estimate, it raises it to the run time of the workunit that just finished. If it wants to lower it, it lowers it a bit at a time. (Actually what it is doing is recalculating a number called the duration correction factor. It also does this after each workunit completes, but usually the newly calculated runtimes are close to their previous values because the actual runtime was close to the originally estimated runtime.) Charlie |
16)
Message boards :
Number crunching :
Rosetta server(s) clocks wrong?
(Message 57618)
Posted 5 Dec 2008 by Charles Dennett Post: I think one or more R@H server clocks may be off. I've noticed this the past few days. So, I just forced an update from one of my systems and watched the list for my computers on the R@H website for the time last contacted. After taking into account the time zone difference between UTC and my time zone, it looks like the clocks are off by about 15 minutes. |
17)
Message boards :
Number crunching :
Problems with web site
(Message 57280)
Posted 27 Nov 2008 by Charles Dennett Post: recently the text formatting of the forums does not work properly. The text is going from side to side on full screen. You have to center the text in your screen and set the user details off the screen in order to read the post. The text formatter will break up strings of words at logical places. Usually that's at the spaces between words, although I've see it also break at hyphens. However, when someone enters a very long string of characters, like one of those super long workunit names that has no spaces or hyphens in it, it will cause the text to extend beyond the usual right margin and resets the right margin out further to accommodate this long string. Since there can only be one right margin on the page, everything now goes out to the new right margin. |
18)
Message boards :
Number crunching :
Rosetta Stalls in the last 5%
(Message 56477)
Posted 25 Oct 2008 by Charles Dennett Post: Because of the varying times, I pretty much ignore the estimated time to completion and the % complete. Rosetta really can't tell how much work it has left to do on a model so those numbers are sometimes way off. Let it go and Boinc/Rosetta will do the right thing. |
19)
Message boards :
Number crunching :
Preemption Failures on Linux
(Message 47580)
Posted 10 Oct 2007 by Charles Dennett Post:
I've been running boinc for quite a while on my home Linux box. It's running fedora 5 with BOINC 5.10.8. It is not overclocked. The MB is an ASUS A7V8X with an AMD XP 2600+ processor. It has 1 GB of memory and prefereces are for 50% when in use and 90% when idle. This machine also runs my website. The darn thing is pretty much rock solid. I rarely have any aborts, freezes, or other problems with RAH or any other project. Right now I'm crunching SIMAP, but when that project does not have work, it crunches RAH. (SIMAP typically has 2-5 days worth of work at the beginning of each month, but this month is unusual in that it was a ton of new work.) So, if you check my results (computers are visible, I believe) you might not see too many results listed. The only time I've seen problems is when my cable connection or router has a problem. Then BOINC can't phone home and results (not just RAH) seem to abort at times. Charlie |
20)
Message boards :
Number crunching :
Windows Vista Defender annoys...
(Message 47380)
Posted 4 Oct 2007 by Charles Dennett Post: I installed it as a service on a Vista laptop and it works fine that way for me. |
©2024 University of Washington
https://www.bakerlab.org