Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 148 · 149 · 150 · 151 · 152 · 153 · 154 . . . 309 · Next
Author | Message |
---|---|
Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 180 Credit: 5,386,173 RAC: 0 |
I've only seen that type of task once, I aborted it after over 20 hours. CPU time was ridiculously low, a few minutes at most, IIRC. Far more times, I see tasks that claim to be running, but with the timer stuck at 0:00 and no results in days. Those tasks get unstuck after relaunching BOINC. I haven't had any issues in a while, knock on wood. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 2,014 |
I've been out of the loop for a little while. Did they recently fix the RAM requirements for the vBox tasks? I'm running 7 Rosetta Python tasks + 1 WCG ARP on 16GBs of RAM. Appears to be PARTIALLY fixed, at least under Windows 10. Same amount of free memory required to START a task, but the amount of memory the task reserves after it starts is usually much less than before. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 9,863 |
I've been out of the loop for a little while. Did they recently fix the RAM requirements for the vBox tasks? I'm running 7 Rosetta Python tasks + 1 WCG ARP on 16GBs of RAM. They seem to have done, except two of my machines refuse to run more than 2 (quad core, 8GB). They don't indicate a shortage of a RAM, the tasks just sit there waiting. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Do you have any of the "Vm job unmanageable" ones on you machine? That will prevent any more from downloading. That is because you are on Windows. BOINC has a pre-made VBox wrapper for that, which is probably what the pythons use: https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables That avoids the COM interface, which causes the problem. However, they don't have a pre-made Linux wrapper that avoids the problem. It also helps to run VirtualBox 5.x.x rather than 6.x.x, which also avoids the COM interface. Since you are on Win7, that is probably what you are using. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 7,494 |
How do you use vboxwrapper on Windows? I can't find any good instructions on how to implement it. D |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
How do you use vboxwrapper on Windows? I can't find any good instructions on how to implement it. Beats me. The project uses it when they compile their stuff insofar as I know. I think LHC did their own wrapper, and fixed the problem for Linux; I run CMS on it without the problem. PS - I tried substituting the wrapper from LHC (vboxwrapper_26196_x86_64-pc-linux-gnu) for the wrapper here. But BOINC does a checksum and rejects it. It uses only the python version here on Rosetta. PPS - I followed the instructions given on Cosmology, which also had the problem to some extent (but less than here it seems). http://www.cosmologyathome.org/forum_thread.php?id=7769#22921 Maybe it works differently there, or on a different version of BOINC, but not here and now. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 7,494 |
There's a version for download here: https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
There's a version for download here: Yes, but the Linux versions state: The following uses the COM interface; not recommended. The version they give for linux (vboxwrapper_26198_x86_64-pc-linux-gnu) is the same version as used here. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
WTH? I got flagged for VM errors? So is 5.x generating VM errors or what? This is getting stupid. I need 5.x for Quchem or at least that is the theory, but i need 6 for here? Switching to 6, abandoned Quchem until I get more RAM, because another person that also does this project and Quchem can run it in 6 and apparently 5 kicks up errors here. And I get all kinds of errors that seem related to memory in Quchem. Craziness! Yeah I know...pushing the machine to far for now. But 2 memory sticks are from one of my setups that I upgraded and offered the old MOBO and CPU to another person here in Europe. So what I want to do in projects I guess with 24gigs is not enough memory. I wish this project would give you a automated headsup if you kick up to many VM errors, but then that is to advanced for this project. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I need 5.x for Quchem or at least that is the theory, but i need 6 for here? VBox 5.2.44 is working fine for me with Win10. But I have 48 GB memory, and am running only 7 work units on a Ryzen 3600. https://boinc.bakerlab.org/rosetta/results.php?hostid=6146985&offset=0&show_names=0&state=4&appid= |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I need 5.x for Quchem or at least that is the theory, but i need 6 for here? Yeah, I am going up towards you limit in RAM after new years. I got 24, but I am going to drop the 2 x 4 and go with 2 x 16 along with the 2 x 8 that I already have. But remember I run a lot of different stuff all at the same time. Since I got erased from python it seems i have to catch up again. so its running 8 python then 4 WCG MCM and 2 sidock plsu einstein and prime grid and FAH. That sucks up 77% of my total memory. Einstein and Prime and FAH are on GPU. Tuillo is running 6 both here and Quchem and having no problem, but he isn't maxing out his machine. |
den777 Send message Joined: 29 Apr 13 Posts: 1 Credit: 1,545,047 RAC: 0 |
Recently I had to abort tasks that are not using CPU and showing no progress for over a day. Virtual machine console looks like this So, you are pushing tasks with obvious errors without even minimal checking if they can start at all? |
gbayler Send message Joined: 10 Apr 20 Posts: 14 Credit: 3,069,484 RAC: 0 |
I have 3 WUs/tasks running longer than any other tasks I have seen before; they don't seem to terminate. Their progress asymptotically approaches 100%, but, as it seems, never reaches it. These are the WUs in question: https://boinc.bakerlab.org/rosetta/result.php?resultid=1462247667 progress: 99.986% elapsed: 2d 23:19:00 CPU time: 00:19:44 https://boinc.bakerlab.org/rosetta/result.php?resultid=1462512698 progress: 99.929% elapsed: 2d 10:03:00 CPU time: 00:15:56 https://boinc.bakerlab.org/rosetta/result.php?resultid=1462518266 progress: 99.822% elapsed: 2d 02:42:00 CPU time: 00:13:54 Do I have to manually abort such WUs? Best regards, Günther |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 7,494 |
Same here - I just found 5 tasks that are all at 99.999% after 3-4 days each. They are aaai, aaad, and abai tasks. I've tried suspending them and then letting them run again but that doesn't help so I'm going to abort them now. Anyone have any idea why this happens? It happens on some machines much more than others- this one is a dual Sandy Bridge Xeon is my worst offender: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3632346 |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 7,494 |
Actually, it looks like the problem might be disk access. I've just had a look at Task Manager on that machine, which is showing that the SSD (120GB Kingston A400) is at 100%. It's only using 6.2GB of 16GB RAM, so I'd be surprised if it's smashing the page file. Stopping BOINC drops disk access to ~0%, and stopping other BOINC projects helped briefly but drive usage is back at 100%. Having aborted a batch of failed VBox tasks, there were a load of new tasks starting up. I presume that start-up requires a lot of disk activity and they're all fighting for it at the same time. EDIT: The disk was full. Windows finally popped a notice up to tell me. I've ordered a new SSD to put BOINC on. The problem is the huge size of these VBox tasks. If one VBox could run multiple threads /tasks then that might save a lot of disk space, assuming they're working from the same dataset. |
gbayler Send message Joined: 10 Apr 20 Posts: 14 Credit: 3,069,484 RAC: 0 |
@dcdc: Thank you for your answer! In my case, there are ~14 GB free on the disk. That's too little to get additional tasks, I can see entries like this in the syslog: Dec 30 14:57:40 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:40 [Rosetta@home] Sending scheduler request: To fetch work. Dec 30 14:57:40 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:40 [Rosetta@home] Requesting new tasks for CPU Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] Scheduler request completed: got 0 new tasks Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] No tasks sent Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] rosetta python projects needs 5292.79MB more disk space. You currently have 13780.69 MB available and it needs 19073.49 MB. Not sure whether this interferes with the running tasks. In addition to the 3 problematic tasks there are 2 other tasks (also VBox tasks) on this machine that seem to run normally. I'm using Ubuntu 21.10 on an i5-8400, if that makes a difference. The system created now another task for the workunit that wasn't finished in time. I'm curious whether the next computer processing this WU will experience the same problems! |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
The jobs that run forever and use very little CPU power ("0 CPU") are only on Linux that I have found. They have been around since half the age of the universe, not that anyone at Rosetta is around to care. As I mention somewhere, they are easy to spot using BoincTask. I just abort them. But they do not seem to be a problem on Windows. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103883#103883 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103823#103823 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103689#103689 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103659#103659 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103493#103493 |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 9,863 |
Actually, it looks like the problem might be disk access. I've just had a look at Task Manager on that machine, which is showing that the SSD (120GB Kingston A400) is at 100%. It's only using 6.2GB of 16GB RAM, so I'd be surprised if it's smashing the page file. Stopping BOINC drops disk access to ~0%, and stopping other BOINC projects helped briefly but drive usage is back at 100%. I run LHC and this on a 24 core machine. When this started Vbox aswell, I had to move Boinc to the rotary drive. I can't afford an SSD that big. |
Charles Tomaras Send message Joined: 18 Aug 09 Posts: 11 Credit: 26,097,452 RAC: 25,779 |
I haven't gotten any work units in at least a week now. I've tried resetting the project. I've now got other stuff running instead of Rosetta. I see no news that it's been down. Anything else I can do to figure out why I'm not receiving work units? |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 7,494 |
Is anyone getting any work? I'm not picking up any python tasks at the moment. I see I'm not the only one! I've been getting work most of the week until now, but the server status shows there should be work available. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org