Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 319 · 320 · 321 · 322 · 323 · 324 · 325 . . . 352 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,506,558
RAC: 3,757
Message 112263 - Posted: 20 Mar 2025, 1:35:16 UTC - in response to Message 112262.  

Check the home page. As of 1900 UTC today, there are over 10 million queued tasks.

I meant Work in Progress, sorry. Currently 198k and just 1k unsent
ID: 112263 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,506,558
RAC: 3,757
Message 112264 - Posted: 20 Mar 2025, 1:39:24 UTC - in response to Message 112234.  

It's currently 4:30pm on Friday afternoon, their time.
It's not looking good ahead of the weekend.

No sooner do I write that and everything's now showing green on the Server page

From now on, would you please make that post on Wednesday afternoon.

If only I could rely on that.
I'm quite prepared to try it as early as possible each Friday - like at 00:05

Well, a very strange thing has happened.
boinc-process server is back running and the validation backlog is fully cleared already.
We've barely reached Thursday morning UK time.
I'm slightly disconcerted now tbh. Everything I thought I knew has been turned upside down.
I shouldn't complain, but it's part of my character by now...
ID: 112264 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mzelden

Send message
Joined: 8 Sep 20
Posts: 3
Credit: 4,095,105
RAC: 4
Message 112265 - Posted: 20 Mar 2025, 3:28:01 UTC - in response to Message 109237.  

Where are those error messages being shown?
Other than what appears to be a heavily loaded system (11.5 hours to do 8 hours work, 4 hrs 15 min to do 3 hrs work), other than the 2 errored Tasks(due to a configuration issue with the Tasks themselves), all the others have processed & Validated without issue.


Seems the message of the screensaver...
ID: 112265 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mzelden

Send message
Joined: 8 Sep 20
Posts: 3
Credit: 4,095,105
RAC: 4
Message 112266 - Posted: 20 Mar 2025, 3:28:30 UTC - in response to Message 112265.  

Where are those error messages being shown?
Other than what appears to be a heavily loaded system (11.5 hours to do 8 hours work, 4 hrs 15 min to do 3 hrs work), other than the 2 errored Tasks(due to a configuration issue with the Tasks themselves), all the others have processed & Validated without issue.


Seems the message of the screensaver...
ID: 112266 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mzelden

Send message
Joined: 8 Sep 20
Posts: 3
Credit: 4,095,105
RAC: 4
Message 112267 - Posted: 20 Mar 2025, 3:32:49 UTC - in response to Message 112266.  

Where are those error messages being shown?
Other than what appears to be a heavily loaded system (11.5 hours to do 8 hours work, 4 hrs 15 min to do 3 hrs work), other than the 2 errored Tasks(due to a configuration issue with the Tasks themselves), all the others have processed & Validated without issue.


Seems the message of the screensaver...



I get this all the time also, I typically see it in the morning after not working on the computer. I will say it started after a MBO change. I've tried uninstalling and installing BOINC a couple of times and it hasn't helped. I haven't tried resetting the Rosetta project. If I just leave the computer idle for the screen saver timeout time, it seems to launch normally and I see the graphics.

BTW, I still had SETI@HOME in my config, but deleted that completely and it didn't help. Rosetta@home is the only project. If I completely delete it and try to add it, I don't lose any credits, correct? I know I have to update first...
ID: 112267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1895
Credit: 18,534,891
RAC: 0
Message 112268 - Posted: 20 Mar 2025, 5:36:28 UTC - in response to Message 112261.  

Things are still very much broken- the amount of work In progress continues to fall, as does the number of Tasks processed per 24hrs.
Heaps of work available, but no one (or pretty much no one) can get any of it.

I'm not disagreeing, but where are you seeing that?
On the main page.
The Grafana graphs make it even easier to see what's going on.

Whatever was broken appears to be working again- work In progress is climbing, the amount of work being returned each 24hrs is also increasing again.
Grant
Darwin NT
ID: 112268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,506,558
RAC: 3,757
Message 112270 - Posted: 21 Mar 2025, 0:16:43 UTC - in response to Message 112268.  

Things are still very much broken- the amount of work In progress continues to fall, as does the number of Tasks processed per 24hrs.
Heaps of work available, but no one (or pretty much no one) can get any of it.

I'm not disagreeing, but where are you seeing that?
On the main page.
The Grafana graphs make it even easier to see what's going on.

Whatever was broken appears to be working again- work In progress is climbing, the amount of work being returned each 24hrs is also increasing again.

Excellent link, thanks. I used to use a different page but forgot what it was.
I can see the WiP figure has risen slightly, but it still seems 20-30k below what it was some weeks ago and there seems so few unsent most of the time.
Still, I'm running down WCG and SiDock on all my PCs and managing to maximise my small cache on each too, if more slowly than usual
ID: 112270 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Random

Send message
Joined: 10 Mar 24
Posts: 8
Credit: 115,388
RAC: 0
Message 112271 - Posted: 21 Mar 2025, 1:44:15 UTC

How long does it take some of you super computers to complete 1 wu? Just curious.
ID: 112271 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1895
Credit: 18,534,891
RAC: 0
Message 112272 - Posted: 21 Mar 2025, 4:34:31 UTC - in response to Message 112271.  

How long does it take some of you super computers to complete 1 wu? Just curious.
Rosetta Tasks are different to other projects- they are set to run for a certain amount of time- 8 hours for Rosetta 4.20 Tasks, and 3 Hours for Rosetta Beta Tasks (although there are some batches where they are set to run for 8 hours). So a more powerful computer won't do more Tasks per day than a less powerful one- however the more powerful computer will do more processing in that time, and so it gets more Credit for each Task for doing the extra work.
Grant
Darwin NT
ID: 112272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2124
Credit: 12,426,657
RAC: 2,579
Message 112273 - Posted: 21 Mar 2025, 6:09:12 UTC - in response to Message 112271.  
Last modified: 21 Mar 2025, 6:09:38 UTC

How long does it take some of you super computers to complete 1 wu? Just curious.


As said Grant, the difference is the number of decoys (in the screensaver are named "Model") you can run in a wu.
An old core I3, for example, makes 50 decoys in 4 hrs in a wu
The same wu, in 4hrs, in a new Threadripper maker 400 decoys (the numbers is at random, only as example)

You have your credit based on the numbers of decoys
ID: 112273 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tom M

Send message
Joined: 20 Jun 17
Posts: 178
Credit: 36,299,045
RAC: 19
Message 112274 - Posted: 22 Mar 2025, 17:33:01 UTC

Apparently the graphics on version 6.06 beta are now not working. Previously they were or a previous version of the beta app was woring.

On Linux.
Proud member of the O.F.A. (Old Farts Association)
ID: 112274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1895
Credit: 18,534,891
RAC: 0
Message 112275 - Posted: 22 Mar 2025, 21:54:35 UTC
Last modified: 22 Mar 2025, 22:00:07 UTC

10 million jobs Queued up, but 0 ready to send.
Once again, the number of Tasks In progress is dropping away as work is returned but people can't get new work to do. More project server system issues to be sorted out.
Been an issue for about 8 hours now.


There have been issues on and off with the Assimilators not keeping up with the load for the last couple of days- they're on the bwsrv1 host.
Also on the bwsrv1 host is the Feeder (responsible for supplying new work)- so it looks like that's the server that's got problems.
Grant
Darwin NT
ID: 112275 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,506,558
RAC: 3,757
Message 112276 - Posted: 23 Mar 2025, 0:34:09 UTC - in response to Message 112275.  

10 million jobs Queued up, but 0 ready to send.
Once again, the number of Tasks In progress is dropping away as work is returned but people can't get new work to do. More project server system issues to be sorted out.
Been an issue for about 8 hours now.


There have been issues on and off with the Assimilators not keeping up with the load for the last couple of days- they're on the bwsrv1 host.
Also on the bwsrv1 host is the Feeder (responsible for supplying new work)- so it looks like that's the server that's got problems.

Yup, I was suspicious of bwsrv1 during last week.
Not that it's gone down - it's never been flagged as not running and half my calls for work are still successful - but it's as if it's running at half speed or even less.
It takes long enough to fix servers when they're definitely not running. I don't know what will trigger someone to even look at this as a problem.
Unless researchers start asking why their tasks aren't coming back as quickly as expected.
Back to crossing fingers...
ID: 112276 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1895
Credit: 18,534,891
RAC: 0
Message 112277 - Posted: 23 Mar 2025, 11:22:51 UTC - in response to Message 112276.  

[quote]10 million jobs Queued up, but 0 ready to send.
Once again, the number of Tasks In progress is dropping away as work is returned but people can't get new work to do. More project server system issues to be sorted out.
Been an issue for about 8 hours now.
Still broken- Tasks In progress has dropped by over 50,000.
Grant
Darwin NT
ID: 112277 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1895
Credit: 18,534,891
RAC: 0
Message 112280 - Posted: 24 Mar 2025, 7:35:50 UTC

Things improved for a while there, but once again the Tasks In progress are falling away & the Assimilators have a backlog that is growing rapidly.
Grant
Darwin NT
ID: 112280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1895
Credit: 18,534,891
RAC: 0
Message 112281 - Posted: 25 Mar 2025, 5:16:44 UTC

Things are still broken, but not as broken.
The Assimilator backlog isn't as bad as it was, and the rate of decline in the amount of work being done has slowed down- but it is still dropping.
It's now about half of what it was (over 205,000 down to 115,000 now).


When there is little to no work, there are no problems getting what is available. Now there is a heap of work available, and it's almost impossible to get any.
Grant
Darwin NT
ID: 112281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tom M

Send message
Joined: 20 Jun 17
Posts: 178
Credit: 36,299,045
RAC: 19
Message 112282 - Posted: 25 Mar 2025, 13:57:56 UTC - in response to Message 112281.  

Things are still broken, but not as broken.
The Assimilator backlog isn't as bad as it was, and the rate of decline in the amount of work being done has slowed down- but it is still dropping.
It's now about half of what it was (over 205,000 down to 115,000 now).


When there is little to no work, there are no problems getting what is available. Now there is a heap of work available, and it's almost impossible to get any.



Nothing in the Ready To Send at the moment.
Proud member of the O.F.A. (Old Farts Association)
ID: 112282 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,506,558
RAC: 3,757
Message 112283 - Posted: 25 Mar 2025, 18:18:18 UTC - in response to Message 112281.  

Things are still broken, but not as broken.
The Assimilator backlog isn't as bad as it was, and the rate of decline in the amount of work being done has slowed down- but it is still dropping.
It's now about half of what it was (over 205,000 down to 115,000 now).


When there is little to no work, there are no problems getting what is available. Now there is a heap of work available, and it's almost impossible to get any.

On the plus side, when boinc-process goes down tomorrow, the number of tasks awaiting validation will be much lower than we're used to seeing...

...I've uncrossed my fingers and started clutching straws to see if that's a better strategy
ID: 112283 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jon b.

Send message
Joined: 27 Dec 09
Posts: 1
Credit: 24,178,068
RAC: 862
Message 112284 - Posted: 25 Mar 2025, 19:27:19 UTC

I currently have my buffer set to store at least 0.15 and up to 0.25 additional days of work, and I have not run out of tasks on any of my computers yet. Another plus to maintaining a bit of a client-side queue is that it can help reduce load on the server by reducing the number of requests. Of course it would be ideal if the servers could keep up with our demand!

Looking back at the Grafana logs, it looks like the boinc-process thing has been happening regularly on Wednesdays for at least a year. There may be a scheduled task performed during the downtime, such as a DB backup. Tasks are still being generated and distributed while boinc-process is down, and are validated when it comes back up. More of an annoyance than an actual operational issue.

The real question is why the project team haven't shared any technical information with volunteers in so long. WCG had some issues a while back, and they did an excellent job of explaining the root cause of the problem and what they did to address it. From my personal experience, I know it can be difficult for researchers to find time for "public relations," but keeping volunteers/donors informed about the work they are contributing resources to shouldn't be neglected.
ID: 112284 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,506,558
RAC: 3,757
Message 112287 - Posted: 26 Mar 2025, 16:26:41 UTC - in response to Message 112284.  
Last modified: 26 Mar 2025, 16:31:43 UTC

I currently have my buffer set to store at least 0.15 and up to 0.25 additional days of work, and I have not run out of tasks on any of my computers yet. Another plus to maintaining a bit of a client-side queue is that it can help reduce load on the server by reducing the number of requests. Of course it would be ideal if the servers could keep up with our demand!

Looking back at the Grafana logs, it looks like the boinc-process thing has been happening regularly on Wednesdays for at least a year. There may be a scheduled task performed during the downtime, such as a DB backup. Tasks are still being generated and distributed while boinc-process is down, and are validated when it comes back up. More of an annoyance than an actual operational issue.

The real question is why the project team haven't shared any technical information with volunteers in so long. WCG had some issues a while back, and they did an excellent job of explaining the root cause of the problem and what they did to address it. From my personal experience, I know it can be difficult for researchers to find time for "public relations," but keeping volunteers/donors informed about the work they are contributing resources to shouldn't be neglected.

Pretty much agree with all that.
I'm just back from a few days in Portugal where I took my laptop with me out there and finally set it up for Boinc and Rosetta only, with just a 0.1 plus 0.1 cache size and 50% CPUs to keep the heat generation down and I managed to grab sufficient tasks to keep it occupied plus none spare.
I've got my main PC with all non-Rosetta projects set to NNT and that's grabbed enough tasks to keep going too. My two other PCs are allowing non-Rosetta tasks to run atm, so they're both only running a couple of Rosetta at a time.

What I'd emphasise yet again is that those tasks that are only 3 hours (Rosetta Beta I think) should be set explicitly at 8hr runtimes rather than allowing the default to knock them down to 3hrs.
This will keep people running a lot longer and reduce the demand for tasks, only then to run out of fresh tasks.
I'm personally <convinced> that this 3hr runtime setting is a mistake, however long it's persisted for.
There's no downside whatsoever as a result of making this change, only an upside for everyone.

I'm sure everyone already knows that boinc-process is down again - being Wednesday. I'd estimate about 10hrs ago.
Crossing fingers that it may come back early again, like it did last week when it returned on Thursday am (UTC) rather than Friday

Edit: weird thing, but the assimilation backlog seems to have cleared down to zero at about the same time as validators went down. No idea what that's about
ID: 112287 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 319 · 320 · 321 · 322 · 323 · 324 · 325 . . . 352 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2025 University of Washington
https://www.bakerlab.org