Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 308 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 80621 - Posted: 10 Sep 2016, 21:46:53 UTC

Please report any issues with work units in this thread.
Rosetta Moderator: Mod.Sense
ID: 80621 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 80622 - Posted: 10 Sep 2016, 21:47:57 UTC

Link to older technical issues thread.
Rosetta Moderator: Mod.Sense
ID: 80622 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2136
Credit: 41,518,559
RAC: 15,775
Message 80626 - Posted: 11 Sep 2016, 3:41:27 UTC

I'll do a full boinc server upgrade when we get our hardware.

Excellent. Thanks.
ID: 80626 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Diplomat
Avatar

Send message
Joined: 2 Aug 10
Posts: 6
Credit: 13,830,152
RAC: 3,055
Message 80627 - Posted: 11 Sep 2016, 17:44:29 UTC

Credits granted value became significantly lower than it used to be.

At the same time table with tasks now almost instantly get cleaned, so it's hard to prove a point (I have 24 hrs computing).

The only visible result now (while I was writing this message it was gone form the table too)

873957277 782555353 10 Sep 2016 13:02:18 UTC 11 Sep 2016 17:27:12 UTC Over Success Done 85,467.54 1,032.09 691.97

Others that I did manage to spot had the same CPU time, Credits Claimed slightly over 1,000 and granted around 550-650

before i would expect values between 850-1,000
ID: 80627 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 80628 - Posted: 11 Sep 2016, 19:30:38 UTC

We are trying to reduce the size of the database and also the load on the database server. So we dramatically shortened the time workunits/results remain in our database temporarily. Sorry for any inconvenience and we do expect to go back to normal time spans soon after we catch up on the large backlog.

Please also do not trust the stats being displayed. I also turned off the stat updates since the db queries were putting a large load on our server slowing things down too much.

I expect it to take a week or so for things to settle back to normal and we will upgrade the database server to a temporary server with much more disk space, faster disks, and double the memory which should hopefully help also. The upgrade will result in another day of downtime as the data gets transferred etc. This will happen sometime next week.
ID: 80628 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SPKA67213

Send message
Joined: 12 Jan 06
Posts: 6
Credit: 20,072,362
RAC: 1,831
Message 80629 - Posted: 12 Sep 2016, 0:38:58 UTC - in response to Message 80628.  

We are trying to reduce the size of the database and also the load on the database server. So we dramatically shortened the time workunits/results remain in our database temporarily. Sorry for any inconvenience and we do expect to go back to normal time spans soon after we catch up on the large backlog.

Please also do not trust the stats being displayed. I also turned off the stat updates since the db queries were putting a large load on our server slowing things down too much.

I expect it to take a week or so for things to settle back to normal and we will upgrade the database server to a temporary server with much more disk space, faster disks, and double the memory which should hopefully help also. The upgrade will result in another day of downtime as the data gets transferred etc. This will happen sometime next week.


Can't receive new work units for the last couple of weeks.

After reading some of the Q&A I'm still not sure what is broke, why, or when whatever is broken will be fixed. Love to contribute my CPU cycles but with nothing to crunch from R@H I need to move to a different project.

It would be nice if the project home page had a headline that says "yeah, we know it's broken and we're working to fix it. LINKY HERE." YMMV and all, but it just seemed a little silly that a casual contributor (4 computers, 12 CPU cores) has to dig around to find out there's likely nothing wrong on his end.

See you next year when R@H *might* be fixed.
ID: 80629 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 80630 - Posted: 12 Sep 2016, 4:41:41 UTC - in response to Message 80629.  

We are trying to reduce the size of the database and also the load on the database server. So we dramatically shortened the time workunits/results remain in our database temporarily. Sorry for any inconvenience and we do expect to go back to normal time spans soon after we catch up on the large backlog.

Please also do not trust the stats being displayed. I also turned off the stat updates since the db queries were putting a large load on our server slowing things down too much.

I expect it to take a week or so for things to settle back to normal and we will upgrade the database server to a temporary server with much more disk space, faster disks, and double the memory which should hopefully help also. The upgrade will result in another day of downtime as the data gets transferred etc. This will happen sometime next week.


Can't receive new work units for the last couple of weeks.

After reading some of the Q&A I'm still not sure what is broke, why, or when whatever is broken will be fixed. Love to contribute my CPU cycles but with nothing to crunch from R@H I need to move to a different project.

It would be nice if the project home page had a headline that says "yeah, we know it's broken and we're working to fix it. LINKY HERE." YMMV and all, but it just seemed a little silly that a casual contributor (4 computers, 12 CPU cores) has to dig around to find out there's likely nothing wrong on his end.

See you next year when R@H *might* be fixed.



I'm not sure why you aren't getting work units. The system seems ok now and clients should be getting jobs. My desktops are crunching and were able to get jobs recently. Can you try to detach and reattach and see if that helps?

ID: 80630 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 80631 - Posted: 12 Sep 2016, 7:57:02 UTC - in response to Message 80629.  

We are trying to reduce the size of the database and also the load on the database server. So we dramatically shortened the time workunits/results remain in our database temporarily. Sorry for any inconvenience and we do expect to go back to normal time spans soon after we catch up on the large backlog.

Please also do not trust the stats being displayed. I also turned off the stat updates since the db queries were putting a large load on our server slowing things down too much.

I expect it to take a week or so for things to settle back to normal and we will upgrade the database server to a temporary server with much more disk space, faster disks, and double the memory which should hopefully help also. The upgrade will result in another day of downtime as the data gets transferred etc. This will happen sometime next week.


Can't receive new work units for the last couple of weeks.

After reading some of the Q&A I'm still not sure what is broke, why, or when whatever is broken will be fixed. Love to contribute my CPU cycles but with nothing to crunch from R@H I need to move to a different project.

It would be nice if the project home page had a headline that says "yeah, we know it's broken and we're working to fix it. LINKY HERE." YMMV and all, but it just seemed a little silly that a casual contributor (4 computers, 12 CPU cores) has to dig around to find out there's likely nothing wrong on his end.

See you next year when R@H *might* be fixed.


Are you running the Mac client? It's been more broken for a while. Especially bad in the last few days, and my Mac has just run out of work again, but has about 15 completed work units waiting to be reported. Per my other thread, I've confirmed that the Windows 10 and Ubuntu versions are running normally (after the recent rounds of flakiness), but the OS X version remains sick.

#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 80631 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2136
Credit: 41,518,559
RAC: 15,775
Message 80638 - Posted: 14 Sep 2016, 10:38:41 UTC - in response to Message 80628.  

Please also do not trust the stats being displayed. I also turned off the stat updates since the db queries were putting a large load on our server slowing things down too much.

A pity. My awarded credits are much higher compared to claimed than they used to be.

Good statement on the front page
ID: 80638 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 80642 - Posted: 14 Sep 2016, 17:33:44 UTC - in response to Message 80638.  

Please also do not trust the stats being displayed. I also turned off the stat updates since the db queries were putting a large load on our server slowing things down too much.

A pity. My awarded credits are much higher compared to claimed than they used to be.

Good statement on the front page


The credit granting stats should be up-to-date. The db server is working well now and seems to be caught up. I'm going to gradually increase the work history timespan also. We'll have a period of downtime soon to upgrade the database server also.
ID: 80642 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 80643 - Posted: 14 Sep 2016, 17:34:50 UTC

My mac is running well. Are others having mac client issues?
ID: 80643 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2136
Credit: 41,518,559
RAC: 15,775
Message 80652 - Posted: 14 Sep 2016, 21:59:17 UTC - in response to Message 80642.  
Last modified: 14 Sep 2016, 22:00:45 UTC

Please also do not trust the stats being displayed. I also turned off the stat updates since the db queries were putting a large load on our server slowing things down too much.

A pity. My awarded credits are much higher compared to claimed than they used to be.

Good statement on the front page

The credit granting stats should be up-to-date. The db server is working well now and seems to be caught up. I'm going to gradually increase the work history timespan also. We'll have a period of downtime soon to upgrade the database server also.

All good here, thanks.

I should mention, though, that all the 10_s??_PH160908 tasks would last a whole lot longer if they didn't hit the 100 decoy limit and shut down within 2 hours. My run-time average is so little I've got over 100 tasks waiting to run now.

That's great for us users but seems counter-productive for the server at this time. Maybe don't increase that history timespan just yet.
ID: 80652 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sinspin

Send message
Joined: 30 Jan 06
Posts: 29
Credit: 6,574,585
RAC: 0
Message 80653 - Posted: 14 Sep 2016, 22:42:37 UTC

have got a lot (15) of validate errors for almost all of my last work batch.

here are 3 of them:
https://boinc.bakerlab.org/rosetta/result.php?resultid=873772221
https://boinc.bakerlab.org/rosetta/result.php?resultid=873771928
https://boinc.bakerlab.org/rosetta/result.php?resultid=873771868

ID: 80653 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 80654 - Posted: 15 Sep 2016, 8:13:29 UTC - in response to Message 80652.  

Please also do not trust the stats being displayed. I also turned off the stat updates since the db queries were putting a large load on our server slowing things down too much.

A pity. My awarded credits are much higher compared to claimed than they used to be.

Good statement on the front page

The credit granting stats should be up-to-date. The db server is working well now and seems to be caught up. I'm going to gradually increase the work history timespan also. We'll have a period of downtime soon to upgrade the database server also.

All good here, thanks.

I should mention, though, that all the 10_s??_PH160908 tasks would last a whole lot longer if they didn't hit the 100 decoy limit and shut down within 2 hours. My run-time average is so little I've got over 100 tasks waiting to run now.

That's great for us users but seems counter-productive for the server at this time. Maybe don't increase that history timespan just yet.


Yes, very observant! I'm planning to fix this 100 model hard limit issue for the next app update. I think there's a command line option also that I'll look into for new jobs in the near future.
ID: 80654 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 80655 - Posted: 15 Sep 2016, 8:18:25 UTC - in response to Message 80653.  

have got a lot (15) of validate errors for almost all of my last work batch.

here are 3 of them:
https://boinc.bakerlab.org/rosetta/result.php?resultid=873772221
https://boinc.bakerlab.org/rosetta/result.php?resultid=873771928
https://boinc.bakerlab.org/rosetta/result.php?resultid=873771868


Hmmm.... I'll take a look at these in detail
ID: 80655 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 80656 - Posted: 15 Sep 2016, 19:33:09 UTC - in response to Message 80654.  

Please also do not trust the stats being displayed. I also turned off the stat updates since the db queries were putting a large load on our server slowing things down too much.

A pity. My awarded credits are much higher compared to claimed than they used to be.

Good statement on the front page

The credit granting stats should be up-to-date. The db server is working well now and seems to be caught up. I'm going to gradually increase the work history timespan also. We'll have a period of downtime soon to upgrade the database server also.

All good here, thanks.

I should mention, though, that all the 10_s??_PH160908 tasks would last a whole lot longer if they didn't hit the 100 decoy limit and shut down within 2 hours. My run-time average is so little I've got over 100 tasks waiting to run now.

That's great for us users but seems counter-productive for the server at this time. Maybe don't increase that history timespan just yet.


Yes, very observant! I'm planning to fix this 100 model hard limit issue for the next app update. I think there's a command line option also that I'll look into for new jobs in the near future.

Interesting stuff about the 10-prefix units and there is also a notice on the top webpage. None of that seems to explain the special or excess problems the Mac client is suffering from. Only (superficial) thing that I've noticed is that the "transitioner" server is often down. Is that required only by the Mac client? (My Windows 10 and Ubuntu Linux machines are mostly running okay, but the Mac has another pile of unreported work units and will soon run out of work (again).)

Some problem with the Preview function here?
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 80656 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
RosettaMac

Send message
Joined: 5 Dec 13
Posts: 11
Credit: 1,639,488
RAC: 0
Message 80657 - Posted: 15 Sep 2016, 20:59:48 UTC - in response to Message 80643.  

My mac is running well. Are others having mac client issues?


My iMac hasn't received any work for several days. I've tried manual Updates and Reset Project from the BOINC Manager, but still get no new work.
ID: 80657 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,888,320
RAC: 0
Message 80660 - Posted: 16 Sep 2016, 19:36:48 UTC - in response to Message 80643.  
Last modified: 16 Sep 2016, 19:43:50 UTC

My mac is running well. Are others having mac client issues?

I'm not having any mac specific issues; the only time I haven't gotten new work seems to coincide with the same problems that effected everyone else.

I do have concerns about a "acourbet.10.design_S" unit that is claiming large amounts of working memory (3.08GB right now, down from 3.2GB). It's been running about 4.5 hours, the last checkpoint was about 2.5 hours ago. It's an older machine with only 4GB of memory but as it's not doing anything else at the moment BOINC is allowed to use all of it. I'll keep an eye on it but thought a head's up might be in order.

edited to add: this is a resend after the previous cruncher's effort ended with an "out of memory error".
workunit


Best,
Snags
ID: 80660 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2001
Credit: 9,780,807
RAC: 8,163
Message 80666 - Posted: 22 Sep 2016, 12:51:39 UTC

Servers are down again....
Any news about this:
An interim solution will be to temporarily upgrade our database server and thanks to our sys admins, we already have a machine ready to go that has plenty of disk space and double the memory. The upgrade will require a day of downtime which is planned to happen early this week.


Are we on temporary server?
ID: 80666 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JonPer

Send message
Joined: 4 May 06
Posts: 14
Credit: 510,105
RAC: 0
Message 80667 - Posted: 22 Sep 2016, 17:34:06 UTC - in response to Message 80660.  

No additional workpackage since the 15th, it is getting cold in here...!!!

My mac is running well. Are others having mac client issues?

I'm not having any mac specific issues; the only time I haven't gotten new work seems to coincide with the same problems that effected everyone else.

I do have concerns about a "acourbet.10.design_S" unit that is claiming large amounts of working memory (3.08GB right now, down from 3.2GB). It's been running about 4.5 hours, the last checkpoint was about 2.5 hours ago. It's an older machine with only 4GB of memory but as it's not doing anything else at the moment BOINC is allowed to use all of it. I'll keep an eye on it but thought a head's up might be in order.

edited to add: this is a resend after the previous cruncher's effort ended with an "out of memory error".
workunit


Best,
Snags


ID: 80667 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 308 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org