Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 148 · 149 · 150 · 151 · 152 · 153 · 154 . . . 311 · Next

AuthorMessage
Tomcat雄猫

Send message
Joined: 20 Dec 14
Posts: 180
Credit: 5,386,173
RAC: 0
Message 103924 - Posted: 27 Dec 2021, 15:18:06 UTC - in response to Message 103920.  
Last modified: 27 Dec 2021, 15:19:37 UTC


the problem is that the one`s I abort are the `one minit wunders` that have only a few seconds of CPU time after several hours of elapsed time, and seem to be pointless to continue running them


I've only seen that type of task once, I aborted it after over 20 hours. CPU time was ridiculously low, a few minutes at most, IIRC. Far more times, I see tasks that claim to be running, but with the timer stuck at 0:00 and no results in days. Those tasks get unstuck after relaunching BOINC.

I haven't had any issues in a while, knock on wood.
ID: 103924 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 103927 - Posted: 27 Dec 2021, 19:17:51 UTC - in response to Message 103923.  

I've been out of the loop for a little while. Did they recently fix the RAM requirements for the vBox tasks? I'm running 7 Rosetta Python tasks + 1 WCG ARP on 16GBs of RAM.

I'm not having issues, for once.

Appears to be PARTIALLY fixed, at least under Windows 10. Same amount of free memory required to START a task, but the amount of memory the task reserves after it starts is usually much less than before.
ID: 103927 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 103928 - Posted: 27 Dec 2021, 19:18:44 UTC - in response to Message 103923.  

I've been out of the loop for a little while. Did they recently fix the RAM requirements for the vBox tasks? I'm running 7 Rosetta Python tasks + 1 WCG ARP on 16GBs of RAM.

I'm not having issues, for once.


They seem to have done, except two of my machines refuse to run more than 2 (quad core, 8GB). They don't indicate a shortage of a RAM, the tasks just sit there waiting.
ID: 103928 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 103933 - Posted: 28 Dec 2021, 17:54:15 UTC - in response to Message 103885.  

Do you have any of the "Vm job unmanageable" ones on you machine? That will prevent any more from downloading.
You need to reboot to fix it. Or find another project.

I luckily have never had any `unmanageable` jobs or things like that,

That is because you are on Windows. BOINC has a pre-made VBox wrapper for that, which is probably what the pythons use:
https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables
That avoids the COM interface, which causes the problem.

However, they don't have a pre-made Linux wrapper that avoids the problem.

It also helps to run VirtualBox 5.x.x rather than 6.x.x, which also avoids the COM interface.
Since you are on Win7, that is probably what you are using.
ID: 103933 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,860,059
RAC: 3,073
Message 103934 - Posted: 28 Dec 2021, 18:16:20 UTC

How do you use vboxwrapper on Windows? I can't find any good instructions on how to implement it.

D
ID: 103934 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 103935 - Posted: 28 Dec 2021, 18:33:03 UTC - in response to Message 103934.  
Last modified: 28 Dec 2021, 18:52:17 UTC

How do you use vboxwrapper on Windows? I can't find any good instructions on how to implement it.

Beats me. The project uses it when they compile their stuff insofar as I know.
I think LHC did their own wrapper, and fixed the problem for Linux; I run CMS on it without the problem.

PS - I tried substituting the wrapper from LHC (vboxwrapper_26196_x86_64-pc-linux-gnu) for the wrapper here.
But BOINC does a checksum and rejects it. It uses only the python version here on Rosetta.

PPS - I followed the instructions given on Cosmology, which also had the problem to some extent (but less than here it seems).
http://www.cosmologyathome.org/forum_thread.php?id=7769#22921
Maybe it works differently there, or on a different version of BOINC, but not here and now.
ID: 103935 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,860,059
RAC: 3,073
Message 103936 - Posted: 28 Dec 2021, 18:49:25 UTC

ID: 103936 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 103937 - Posted: 28 Dec 2021, 18:54:43 UTC - in response to Message 103936.  

There's a version for download here:

https://boinc.berkeley.edu/trac/wiki/VboxApps#Premadevboxwrapperexecutables

Yes, but the Linux versions state:
The following uses the COM interface; not recommended.

The version they give for linux (vboxwrapper_26198_x86_64-pc-linux-gnu) is the same version as used here.
ID: 103937 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 103940 - Posted: 29 Dec 2021, 18:55:46 UTC
Last modified: 29 Dec 2021, 19:02:19 UTC

WTH? I got flagged for VM errors?
So is 5.x generating VM errors or what?
This is getting stupid.
I need 5.x for Quchem or at least that is the theory, but i need 6 for here?

Switching to 6, abandoned Quchem until I get more RAM, because another person that also does this project and Quchem can run it in 6 and apparently 5 kicks up errors here. And I get all kinds of errors that seem related to memory in Quchem.
Craziness!

Yeah I know...pushing the machine to far for now. But 2 memory sticks are from one of my setups that I upgraded and offered the old MOBO and CPU to another person here in Europe. So what I want to do in projects I guess with 24gigs is not enough memory.

I wish this project would give you a automated headsup if you kick up to many VM errors, but then that is to advanced for this project.
ID: 103940 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 103941 - Posted: 29 Dec 2021, 19:15:35 UTC - in response to Message 103940.  

I need 5.x for Quchem or at least that is the theory, but i need 6 for here?

VBox 5.2.44 is working fine for me with Win10. But I have 48 GB memory, and am running only 7 work units on a Ryzen 3600.
https://boinc.bakerlab.org/rosetta/results.php?hostid=6146985&offset=0&show_names=0&state=4&appid=
ID: 103941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 103942 - Posted: 29 Dec 2021, 23:17:39 UTC - in response to Message 103941.  

I need 5.x for Quchem or at least that is the theory, but i need 6 for here?

VBox 5.2.44 is working fine for me with Win10. But I have 48 GB memory, and am running only 7 work units on a Ryzen 3600.
https://boinc.bakerlab.org/rosetta/results.php?hostid=6146985&offset=0&show_names=0&state=4&appid=



Yeah, I am going up towards you limit in RAM after new years.
I got 24, but I am going to drop the 2 x 4 and go with 2 x 16 along with the 2 x 8 that I already have.
But remember I run a lot of different stuff all at the same time.

Since I got erased from python it seems i have to catch up again. so its running 8 python then 4 WCG MCM and 2 sidock plsu einstein and prime grid and FAH. That sucks up 77% of my total memory.
Einstein and Prime and FAH are on GPU.

Tuillo is running 6 both here and Quchem and having no problem, but he isn't maxing out his machine.
ID: 103942 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
den777

Send message
Joined: 29 Apr 13
Posts: 1
Credit: 1,545,047
RAC: 0
Message 103945 - Posted: 30 Dec 2021, 10:13:48 UTC

Recently I had to abort tasks that are not using CPU and showing no progress for over a day.
Virtual machine console looks like this

So, you are pushing tasks with obvious errors without even minimal checking if they can start at all?
ID: 103945 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gbayler

Send message
Joined: 10 Apr 20
Posts: 14
Credit: 3,069,484
RAC: 0
Message 103946 - Posted: 30 Dec 2021, 11:50:48 UTC

I have 3 WUs/tasks running longer than any other tasks I have seen before; they don't seem to terminate. Their progress asymptotically approaches 100%, but, as it seems, never reaches it.

These are the WUs in question:

https://boinc.bakerlab.org/rosetta/result.php?resultid=1462247667 progress: 99.986% elapsed: 2d 23:19:00 CPU time: 00:19:44
https://boinc.bakerlab.org/rosetta/result.php?resultid=1462512698 progress: 99.929% elapsed: 2d 10:03:00 CPU time: 00:15:56
https://boinc.bakerlab.org/rosetta/result.php?resultid=1462518266 progress: 99.822% elapsed: 2d 02:42:00 CPU time: 00:13:54

Do I have to manually abort such WUs?

Best regards,

Günther
ID: 103946 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,860,059
RAC: 3,073
Message 103947 - Posted: 30 Dec 2021, 12:34:10 UTC - in response to Message 103946.  

Same here - I just found 5 tasks that are all at 99.999% after 3-4 days each. They are aaai, aaad, and abai tasks. I've tried suspending them and then letting them run again but that doesn't help so I'm going to abort them now.

Anyone have any idea why this happens? It happens on some machines much more than others- this one is a dual Sandy Bridge Xeon is my worst offender:

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3632346
ID: 103947 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,860,059
RAC: 3,073
Message 103948 - Posted: 30 Dec 2021, 12:55:02 UTC
Last modified: 30 Dec 2021, 13:07:19 UTC

Actually, it looks like the problem might be disk access. I've just had a look at Task Manager on that machine, which is showing that the SSD (120GB Kingston A400) is at 100%. It's only using 6.2GB of 16GB RAM, so I'd be surprised if it's smashing the page file. Stopping BOINC drops disk access to ~0%, and stopping other BOINC projects helped briefly but drive usage is back at 100%.

Having aborted a batch of failed VBox tasks, there were a load of new tasks starting up. I presume that start-up requires a lot of disk activity and they're all fighting for it at the same time.

EDIT: The disk was full. Windows finally popped a notice up to tell me. I've ordered a new SSD to put BOINC on. The problem is the huge size of these VBox tasks. If one VBox could run multiple threads /tasks then that might save a lot of disk space, assuming they're working from the same dataset.
ID: 103948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gbayler

Send message
Joined: 10 Apr 20
Posts: 14
Credit: 3,069,484
RAC: 0
Message 103949 - Posted: 30 Dec 2021, 14:29:53 UTC

@dcdc: Thank you for your answer!

In my case, there are ~14 GB free on the disk. That's too little to get additional tasks, I can see entries like this in the syslog:
Dec 30 14:57:40 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:40 [Rosetta@home] Sending scheduler request: To fetch work.
Dec 30 14:57:40 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:40 [Rosetta@home] Requesting new tasks for CPU
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] Scheduler request completed: got 0 new tasks
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] No tasks sent
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] rosetta python projects needs 5292.79MB more disk space.  You currently have 13780.69 MB available and it needs 19073.49 MB.

Not sure whether this interferes with the running tasks. In addition to the 3 problematic tasks there are 2 other tasks (also VBox tasks) on this machine that seem to run normally.

I'm using Ubuntu 21.10 on an i5-8400, if that makes a difference.

The system created now another task for the workunit that wasn't finished in time. I'm curious whether the next computer processing this WU will experience the same problems!
ID: 103949 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 103950 - Posted: 30 Dec 2021, 14:57:03 UTC
Last modified: 30 Dec 2021, 15:00:48 UTC

The jobs that run forever and use very little CPU power ("0 CPU") are only on Linux that I have found.
They have been around since half the age of the universe, not that anyone at Rosetta is around to care.

As I mention somewhere, they are easy to spot using BoincTask. I just abort them. But they do not seem to be a problem on Windows.

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103883#103883
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103823#103823
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103689#103689
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103659#103659
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103493#103493
ID: 103950 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 103951 - Posted: 30 Dec 2021, 21:34:49 UTC - in response to Message 103948.  

Actually, it looks like the problem might be disk access. I've just had a look at Task Manager on that machine, which is showing that the SSD (120GB Kingston A400) is at 100%. It's only using 6.2GB of 16GB RAM, so I'd be surprised if it's smashing the page file. Stopping BOINC drops disk access to ~0%, and stopping other BOINC projects helped briefly but drive usage is back at 100%.

Having aborted a batch of failed VBox tasks, there were a load of new tasks starting up. I presume that start-up requires a lot of disk activity and they're all fighting for it at the same time.

EDIT: The disk was full. Windows finally popped a notice up to tell me. I've ordered a new SSD to put BOINC on. The problem is the huge size of these VBox tasks. If one VBox could run multiple threads /tasks then that might save a lot of disk space, assuming they're working from the same dataset.


I run LHC and this on a 24 core machine. When this started Vbox aswell, I had to move Boinc to the rotary drive. I can't afford an SSD that big.
ID: 103951 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Charles Tomaras

Send message
Joined: 18 Aug 09
Posts: 11
Credit: 26,215,730
RAC: 18,853
Message 103952 - Posted: 30 Dec 2021, 23:47:33 UTC

I haven't gotten any work units in at least a week now. I've tried resetting the project. I've now got other stuff running instead of Rosetta. I see no news that it's been down. Anything else I can do to figure out why I'm not receiving work units?
ID: 103952 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,860,059
RAC: 3,073
Message 103953 - Posted: 30 Dec 2021, 23:59:30 UTC
Last modified: 31 Dec 2021, 0:00:11 UTC

Is anyone getting any work? I'm not picking up any python tasks at the moment.

I see I'm not the only one! I've been getting work most of the week until now, but the server status shows there should be work available.
ID: 103953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 148 · 149 · 150 · 151 · 152 · 153 · 154 . . . 311 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org