Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 149 · 150 · 151 · 152 · 153 · 154 · 155 . . . 235 · Next

AuthorMessage
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1828
Credit: 107,039,762
RAC: 9,282
Message 103947 - Posted: 30 Dec 2021, 12:34:10 UTC - in response to Message 103946.  

Same here - I just found 5 tasks that are all at 99.999% after 3-4 days each. They are aaai, aaad, and abai tasks. I've tried suspending them and then letting them run again but that doesn't help so I'm going to abort them now.

Anyone have any idea why this happens? It happens on some machines much more than others- this one is a dual Sandy Bridge Xeon is my worst offender:

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3632346
ID: 103947 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1828
Credit: 107,039,762
RAC: 9,282
Message 103948 - Posted: 30 Dec 2021, 12:55:02 UTC
Last modified: 30 Dec 2021, 13:07:19 UTC

Actually, it looks like the problem might be disk access. I've just had a look at Task Manager on that machine, which is showing that the SSD (120GB Kingston A400) is at 100%. It's only using 6.2GB of 16GB RAM, so I'd be surprised if it's smashing the page file. Stopping BOINC drops disk access to ~0%, and stopping other BOINC projects helped briefly but drive usage is back at 100%.

Having aborted a batch of failed VBox tasks, there were a load of new tasks starting up. I presume that start-up requires a lot of disk activity and they're all fighting for it at the same time.

EDIT: The disk was full. Windows finally popped a notice up to tell me. I've ordered a new SSD to put BOINC on. The problem is the huge size of these VBox tasks. If one VBox could run multiple threads /tasks then that might save a lot of disk space, assuming they're working from the same dataset.
ID: 103948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gbayler

Send message
Joined: 10 Apr 20
Posts: 14
Credit: 3,069,484
RAC: 0
Message 103949 - Posted: 30 Dec 2021, 14:29:53 UTC

@dcdc: Thank you for your answer!

In my case, there are ~14 GB free on the disk. That's too little to get additional tasks, I can see entries like this in the syslog:
Dec 30 14:57:40 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:40 [Rosetta@home] Sending scheduler request: To fetch work.
Dec 30 14:57:40 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:40 [Rosetta@home] Requesting new tasks for CPU
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] Scheduler request completed: got 0 new tasks
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] No tasks sent
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] rosetta python projects needs 5292.79MB more disk space.  You currently have 13780.69 MB available and it needs 19073.49 MB.

Not sure whether this interferes with the running tasks. In addition to the 3 problematic tasks there are 2 other tasks (also VBox tasks) on this machine that seem to run normally.

I'm using Ubuntu 21.10 on an i5-8400, if that makes a difference.

The system created now another task for the workunit that wasn't finished in time. I'm curious whether the next computer processing this WU will experience the same problems!
ID: 103949 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 877
Credit: 51,526,729
RAC: 1,646
Message 103950 - Posted: 30 Dec 2021, 14:57:03 UTC
Last modified: 30 Dec 2021, 15:00:48 UTC

The jobs that run forever and use very little CPU power ("0 CPU") are only on Linux that I have found.
They have been around since half the age of the universe, not that anyone at Rosetta is around to care.

As I mention somewhere, they are easy to spot using BoincTask. I just abort them. But they do not seem to be a problem on Windows.

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103883#103883
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103823#103823
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103689#103689
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103659#103659
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6893&postid=103493#103493
ID: 103950 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 12 Aug 06
Posts: 1304
Credit: 5,974,949
RAC: 12,092
Message 103951 - Posted: 30 Dec 2021, 21:34:49 UTC - in response to Message 103948.  

Actually, it looks like the problem might be disk access. I've just had a look at Task Manager on that machine, which is showing that the SSD (120GB Kingston A400) is at 100%. It's only using 6.2GB of 16GB RAM, so I'd be surprised if it's smashing the page file. Stopping BOINC drops disk access to ~0%, and stopping other BOINC projects helped briefly but drive usage is back at 100%.

Having aborted a batch of failed VBox tasks, there were a load of new tasks starting up. I presume that start-up requires a lot of disk activity and they're all fighting for it at the same time.

EDIT: The disk was full. Windows finally popped a notice up to tell me. I've ordered a new SSD to put BOINC on. The problem is the huge size of these VBox tasks. If one VBox could run multiple threads /tasks then that might save a lot of disk space, assuming they're working from the same dataset.


I run LHC and this on a 24 core machine. When this started Vbox aswell, I had to move Boinc to the rotary drive. I can't afford an SSD that big.
ID: 103951 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Charles Tomaras

Send message
Joined: 18 Aug 09
Posts: 11
Credit: 19,381,341
RAC: 2,082
Message 103952 - Posted: 30 Dec 2021, 23:47:33 UTC

I haven't gotten any work units in at least a week now. I've tried resetting the project. I've now got other stuff running instead of Rosetta. I see no news that it's been down. Anything else I can do to figure out why I'm not receiving work units?
ID: 103952 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1828
Credit: 107,039,762
RAC: 9,282
Message 103953 - Posted: 30 Dec 2021, 23:59:30 UTC
Last modified: 31 Dec 2021, 0:00:11 UTC

Is anyone getting any work? I'm not picking up any python tasks at the moment.

I see I'm not the only one! I've been getting work most of the week until now, but the server status shows there should be work available.
ID: 103953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1194
Credit: 13,232,640
RAC: 1,146
Message 103954 - Posted: 31 Dec 2021, 0:00:04 UTC - in response to Message 103952.  
Last modified: 31 Dec 2021, 0:04:54 UTC

I haven't gotten any work units in at least a week now. I've tried resetting the project. I've now got other stuff running instead of Rosetta. I see no news that it's been down. Anything else I can do to figure out why I'm not receiving work units?

Check if you have virtualization enabled, and check if BOINC was installed with vbox.

Some of this information appears near the start of the BOINC log file, if it was started recently enough.

Also, check the server status at:

https://boinc.bakerlab.org/rosetta/server_status.php

The number of available tasks is currently rather low, and it's possible that all of these require different hardware than you have.
ID: 103954 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1828
Credit: 107,039,762
RAC: 9,282
Message 103955 - Posted: 31 Dec 2021, 1:08:28 UTC - in response to Message 103953.  

Mine was because those machines all produced errors so I had to go to details and hit "Allow" for each one.
ID: 103955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1828
Credit: 107,039,762
RAC: 9,282
Message 103956 - Posted: 31 Dec 2021, 1:08:32 UTC - in response to Message 103953.  

Mine was because those machines all produced errors so I had to go to details and hit "Allow" for each one.
ID: 103956 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1828
Credit: 107,039,762
RAC: 9,282
Message 103957 - Posted: 31 Dec 2021, 1:08:33 UTC - in response to Message 103953.  

Mine was because those machines all produced errors so I had to go to details and hit "Allow" for each one.
ID: 103957 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 877
Credit: 51,526,729
RAC: 1,646
Message 103958 - Posted: 31 Dec 2021, 5:07:44 UTC - in response to Message 103950.  

The jobs that run forever and use very little CPU power ("0 CPU") are only on Linux that I have found.
They have been around since half the age of the universe, not that anyone at Rosetta is around to care.

As I mention somewhere, they are easy to spot using BoincTask. I just abort them. But they do not seem to be a problem on Windows.

I just had my first 0 CPU job on Win10, so I aborted it. Too bad.
I was going to convert another Ubuntu machine to Windows, but I don't think so. I will wait until they fix it.
ID: 103958 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5556
Credit: 5,554,976
RAC: 55
Message 103960 - Posted: 31 Dec 2021, 9:44:19 UTC - in response to Message 103953.  

Is anyone getting any work? I'm not picking up any python tasks at the moment.

I see I'm not the only one! I've been getting work most of the week until now, but the server status shows there should be work available.



I got hit with the same thing.
A load of BS if you ask me.
Any errors is on the RAH team.
I run other VM projects no problems.
ID: 103960 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 12 Aug 06
Posts: 1304
Credit: 5,974,949
RAC: 12,092
Message 103964 - Posted: 31 Dec 2021, 15:51:05 UTC - in response to Message 103958.  

The jobs that run forever and use very little CPU power ("0 CPU") are only on Linux that I have found.
They have been around since half the age of the universe, not that anyone at Rosetta is around to care.

As I mention somewhere, they are easy to spot using BoincTask. I just abort them. But they do not seem to be a problem on Windows.

I just had my first 0 CPU job on Win10, so I aborted it. Too bad.
I was going to convert another Ubuntu machine to Windows, but I don't think so. I will wait until they fix it.


I run them in about 4 to 7 hours on my Ryzen 9 3900XT. But one of them got up to 2 days and slowed right down past 99% complete. I aborted it.

I've given up running them on slower machines, I just use the Ryzen and the i5. The others keep going over the deadline.
ID: 103964 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 185
Credit: 23,191,784
RAC: 1,146
Message 103965 - Posted: 31 Dec 2021, 22:34:33 UTC - in response to Message 103949.  

@dcdc: Thank you for your answer!
In my case, there are ~14 GB free on the disk. That's too little to get additional tasks, I can see entries like this in the syslog:
Dec 30 14:57:40 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:40 [Rosetta@home] Sending scheduler request: To fetch work.
Dec 30 14:57:40 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:40 [Rosetta@home] Requesting new tasks for CPU
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] Scheduler request completed: got 0 new tasks
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] No tasks sent
Dec 30 14:57:42 i5-be-quiet boinc[2340611]: 30-Dec-2021 14:57:42 [Rosetta@home] rosetta python projects needs 5292.79MB more disk space.  You currently have 13780.69 MB available and it needs 19073.49 MB.

I have been getting the `disk space` messages for some time even with 200GBG "free available to boinc"
I have spent a lot of time messing about with the the thing to try and fix it with no affect.
That "19073.49MB" message is always the same size on either system I run, whatever the other `want more` / `have got` variable size (MB) is.
I think it has to be something written into the app itself,
I also have difficulty getting more tasks if the `disk space` message is recently in the event log.
ID: 103965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5556
Credit: 5,554,976
RAC: 55
Message 103968 - Posted: 1 Jan 2022, 0:24:30 UTC - in response to Message 103964.  

The jobs that run forever and use very little CPU power ("0 CPU") are only on Linux that I have found.
They have been around since half the age of the universe, not that anyone at Rosetta is around to care.

As I mention somewhere, they are easy to spot using BoincTask. I just abort them. But they do not seem to be a problem on Windows.

I just had my first 0 CPU job on Win10, so I aborted it. Too bad.
I was going to convert another Ubuntu machine to Windows, but I don't think so. I will wait until they fix it.


I run them in about 4 to 7 hours on my Ryzen 9 3900XT. But one of them got up to 2 days and slowed right down past 99% complete. I aborted it.

I've given up running them on slower machines, I just use the Ryzen and the i5. The others keep going over the deadline.



Ryzen 7 3700x (auto overclock) takes 4.8 hrs max to chew through Python.
ID: 103968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Pepino65

Send message
Joined: 30 Jan 12
Posts: 1
Credit: 717,255
RAC: 285
Message 103972 - Posted: 1 Jan 2022, 9:40:47 UTC

The estimated time of 8 hours was completely unrealistic. I aborted 3 wu from 6 wu. Checkpoints are missing, I do not consider using the virtual box happy. The units below run a second time, the first time after 4 hours and a subsequent restart, the crunching has started again from the beginning.

ID: 103972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 12 Aug 06
Posts: 1304
Credit: 5,974,949
RAC: 12,092
Message 103974 - Posted: 1 Jan 2022, 11:42:05 UTC

How come my i5 is running them continuously, but when my Ryzen asks for them It says got no new tasks? There seems to be a continuous supply in server status.
ID: 103974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1828
Credit: 107,039,762
RAC: 9,282
Message 103977 - Posted: 1 Jan 2022, 12:53:12 UTC - in response to Message 103974.  

Have you checked on the Ryzen computer's page to see whether the "allow" button is showing? It should say "skip".
ID: 103977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1828
Credit: 107,039,762
RAC: 9,282
Message 103978 - Posted: 1 Jan 2022, 12:53:14 UTC - in response to Message 103974.  
Last modified: 1 Jan 2022, 12:53:37 UTC

Duplicate...
ID: 103978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 149 · 150 · 151 · 152 · 153 · 154 · 155 . . . 235 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2022 University of Washington
https://www.bakerlab.org