If You Don't Know Where to Put it, Post it here.

Message boards : Number crunching : If You Don't Know Where to Put it, Post it here.

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93915 - Posted: 8 Apr 2020, 21:21:33 UTC - in response to Message 93908.  

I don't want to sound over confident that I have a perfect understanding of all of the relevant facts... but yes. To me, the symptoms you report with tasks running longer than runtime preference, and resetting to zero when you restart the machine, (which implies they never reached completion of the first model), it all sounds like different ways of describing the same problem. You noticed the progress resetting when your machine is shutdown overnight. Others reported it as low credit for WUs, and when I look at them they all seem to have been ended by the watchdog (just like yours) and don't seem to have completed their first model (just like yours). I mean I don't see any unexplained symptoms that preclude these being the same root cause.

Unfortunately, given all of the work they have on the table to assimilate the COVID results coming in, it sounds like it may be some time before they can work on a fix. I can't even guess whether that means weeks or months. So the suggestion is to let the machine explore some other BOINC projects for a while (you can just mark R@h for "no new tasks", and leave the project on the list), and add Ralph to the list of projects, (use a low resource share since work is sparse), this assures they get some i686 Linux hosts in their next round of testing.
Rosetta Moderator: Mod.Sense
ID: 93915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 376
Credit: 10,746,719
RAC: 6,008
Message 93916 - Posted: 8 Apr 2020, 21:21:34 UTC - in response to Message 93896.  

Greetings,

Ok. I just logged back in and 11 of my 12 tasks started over at ZERO! A few were getting somewhat close to finish when I shut down BOINC and logged into Windows 10.

For those here, including Grant, that think I might be blowing smoke about this, I have a little test for you to perform and I bet that what happens when I log back in will happen to many of you.

It's really simple. Take note of where your tasks are at in elapsed and remaining time. Don't need to be precise, just a mental note. Shut down BOINC including the app(s), wait a few seconds or so then restart BOINC. I'll bet $10 bucks that some, if not all, of your tasks will restart at zero.

@Grant: My settings are damn near identical to yours.

I really would like to continue with Rosetta, but if something isn't done about the checkpoints... forget it. I can live with the tasks running in high priority, I'm just tired of the wasted work because no checkpoints are being set.

Have a great day! :)

Siran


Guarantee that it does not on my systems - during the course of today I’ve brought two of my systems down and restarted and all of the Rosetta tasks restarted more or less where they left off.
ID: 93916 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93918 - Posted: 8 Apr 2020, 21:25:50 UTC - in response to Message 93916.  

A quick peek at Bryn Mawr's 3 hosts, which he says are not seeing the problem, and looking at a random completed task from each, and they all are running the rosetta_4.12_x86_64-pc-linux-gnu application, not the i686.
Rosetta Moderator: Mod.Sense
ID: 93918 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Siran d'Vel'nahr
Avatar

Send message
Joined: 15 Nov 06
Posts: 72
Credit: 2,674,678
RAC: 0
Message 93919 - Posted: 8 Apr 2020, 21:30:12 UTC - in response to Message 93916.  

Hi Bryn,

Guarantee that it does not on my systems - during the course of today I’ve brought two of my systems down and restarted and all of the Rosetta tasks restarted more or less where they left off.

Good, then it looks like you are running a different app on the tasks than I am. The i686 app seems to have an issue which is what my tasks are running on.

Thanks and have a great day! :)

Siran
CAPT Siran d'Vel'nahr XO
USS Vre'kasht NCC-33187

"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 93919 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Siran d'Vel'nahr
Avatar

Send message
Joined: 15 Nov 06
Posts: 72
Credit: 2,674,678
RAC: 0
Message 93920 - Posted: 8 Apr 2020, 21:33:07 UTC - in response to Message 93918.  

A quick peek at Bryn Mawr's 3 hosts, which he says are not seeing the problem, and looking at a random completed task from each, and they all are running the rosetta_4.12_x86_64-pc-linux-gnu application, not the i686.

Hi Mod,

LOL! You posted just before I posted my reply to Byrn's post.

By the way, I already have Rosetta set to NNT. :)

Thanks and have a great day! :)

Siran
CAPT Siran d'Vel'nahr XO
USS Vre'kasht NCC-33187

"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 93920 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 376
Credit: 10,746,719
RAC: 6,008
Message 93921 - Posted: 8 Apr 2020, 21:39:13 UTC - in response to Message 93918.  

A quick peek at Bryn Mawr's 3 hosts, which he says are not seeing the problem, and looking at a random completed task from each, and they all are running the rosetta_4.12_x86_64-pc-linux-gnu application, not the i686.


I’ve had quite a few of the i686 tasks, I guess there were none when I bounced the machines :-)
ID: 93921 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2000
Credit: 38,577,290
RAC: 17,083
Message 94054 - Posted: 10 Apr 2020, 10:29:06 UTC

This topic has reminded me of things I wanted to say over the last few weeks.

Having COVID19 tasks here has brought the arrival of lots of completely new users to Boinc, but almost coincidentally a lot of <highly> experienced Boinc users from Seti, bringing with them some mega crunching power.
And together, the number of active users has gone through the roof, the number of tasks burned through has followed and together they've stretched the demands of the project. From 40k active hosts to 240k, with credits awarded going up 6 times. But it's also brought up some issues.

Seti is not Rosetta. Not by a long chalk.
A lot of the assumptions learned at Seti don't apply at Rosetta. Some are the opposite. And that's brought some conflict with it.

So when I read users saying Seti allowed longer deadlines so Rosetta should extend their deadlines to suit users, not the project, that's not right.
And when I read users saying 'I've got a megacrunching behemoth and I should be completing tasks quicker, but I don't, so the tasks aren't running properly and Rosetta needs to sort it out', that's not right. (Your behemoth is running more within the target runtime than slower machines so you get higher credit that way)
And when users say I'm used to hoarding weeks of tasks to meet longer deadlines but now I don't have enough (to the exclusion of others when tasks are limited), that's not right. Rosetta actually does something with their results, looks at those returned and has the ability to tweak the next batch of tasks to account for the previous iteration, so a quick turnaround time of completed tasks is highly important.
And when I see a comment like, I know Rosetta because I ran it in 2006, 1 year after the project launched, so in 2020 it should run the same. That's not right.

And if the graphics don't run on a certain platform and the credits are a bit skewed with a new program version and if a certain batch crashes out or the validation fails and the servers conk out, that's clearly wrong and it'll get sorted sooner or later but it's not the priority.

And if it's too much for users to bear and you feel like you have to go off in a huff almost as if satisfying you as a user is a higher priority than the current purpose of the project, instead of having a little patience, reporting the shortcomings so things can be revised properly next time round and committing just as much to the next round of tasks as you did the last, then too bad.

You're not running Seti any more, with all the assumptions you learned to run there successfully. At Rosetta, some of those assumptions are different, so keep an eye out, cooperate with the needs of this project, not some whole other project with an entire different goalpurposetime-scaleset of priorities, and bring that wealth of expertise and experience along as well and things will be better for everyone.

Yeah, condescending, I know. Save it.
ID: 94054 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Siran d'Vel'nahr
Avatar

Send message
Joined: 15 Nov 06
Posts: 72
Credit: 2,674,678
RAC: 0
Message 94227 - Posted: 12 Apr 2020, 10:39:39 UTC - in response to Message 94054.  

Yeah, condescending, I know. Save it.

Condescending is an understatement!

Much of what you said was posted in this thread has been exaggerated by you or is outright false.

If you would have read, an issue has been discovered with the v4.12 i686 app and/or the corresponding tasks and, it could take weeks or months for a fix to show up.
CAPT Siran d'Vel'nahr XO
USS Vre'kasht NCC-33187

"Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath
ID: 94227 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom M

Send message
Joined: 20 Jun 17
Posts: 85
Credit: 5,411,032
RAC: 57,435
Message 94229 - Posted: 12 Apr 2020, 11:17:20 UTC - in response to Message 94054.  

This topic has reminded me of things I wanted to say over the last few weeks.


Dear Sid,
You are both on target and off target.

1) Many of the "Mega-crunchers" that have joined here have upto 14 gpu systems. Some of them don't crunch cpu tasks at all.
1a) The original "spoofing" techniques were only useful for gpu tasks.
The 128 thread highly productive systems at S@H were in a minority and only a few were even in the top 100 of the leader board.

2) I am not not sure some of the bunkering tools from S@H even apply here. They were used to switch cpu tasks to the gpu etc.

3) Other bunkering tools may not even work here. One of the purposes of "bunkering" was for contests or gaining higher RAC or Totals at the end of the S@H project or allowing us to process while the S@H server was down. S@H went down every Tuesday for 3-12 hours.

4) The type of tasks are changed depending on the results. This is very encouraging to have that level of sensitivity to crunching results. We didn't have that at S@H.

5) There was speculation about what would happen when former S@H landed on other projects. And yup, Projects that a lot of "us" have landed on have reported stretched servers. Welcome to the server world of S@H. :)

6) You are right. There has been a learning curve on the part of the experienced BOINC users.

7) Most of the Seti Orphans are running 16 cpu thread or less machines. So for the most part you are getting the results of a LOT of 4c/6c/8c machines joining. These participants were NOT mega crunchers.

8) How would YOU feel if you couldn't get your box to crunch tasks? They were error-ing out over programming issues?

Sid, hope you feel better now that you have unloaded.

Respectfully
Tom M
(one of those "mega" crunchers)
( yes I am crunching 18 threads@8 hours a task on a 16c/32t box and have NOT run dry [yet])
Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel.....
ID: 94229 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom M

Send message
Joined: 20 Jun 17
Posts: 85
Credit: 5,411,032
RAC: 57,435
Message 94235 - Posted: 12 Apr 2020, 12:45:59 UTC - in response to Message 94229.  

I just took a look at the current list of "Mega Crunchers" systems on the Rosetti@Home.

In the top 20 I don't recognize a single "type" of system that was/is common at Seti@Home.

Instead I see a huge number of "very high thread count" systems (upwards to 192 cores/threads).

Another thing to note is the top producers are not unanimous in running the default 8 hours setting.

Instead a number of them are running 3 hours or less tasks.

If 8 hour tasks really do stress the the server(s) less and I have every reason to believe they do.

Perhaps we should start "nagging" (in private or setup a public shame list?) for every top producer (say the top 100) not running at least the 8 hour default to start running the default.

Presumably they would get the same credit scores but access the server(s) less often?

And given the volume they may be running it should lower the load on the server(s) significantly.

Respectfully,
Tom M
Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel.....
ID: 94235 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94246 - Posted: 12 Apr 2020, 14:40:16 UTC - in response to Message 94235.  

Credit is based on the number of completed models reported and validated, for the specific batch of tasks being run. Each batch accumulates an average credit claim per model and this is used to grant credit as completed tasks are reported back.

This means that regardless of numbers of cores or threads, or how fast your bus or memory access is, or how large your L3 cache, the number of credits granted per model will be nearly identical for all users reporting results to that batch of work.

It also means that faster CPUs (where "faster" means lower time to complete each R@h model), will complete more models per hour of runtime, and be granted more credit per hour of runtime.

The models are no different in the last hour of running a WU than the first, regardless of your runtime preference.
Rosetta Moderator: Mod.Sense
ID: 94246 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
chrislaf

Send message
Joined: 28 Mar 20
Posts: 3
Credit: 1,487,482
RAC: 0
Message 94249 - Posted: 12 Apr 2020, 15:19:19 UTC - in response to Message 94235.  

I just took a look at the current list of "Mega Crunchers" systems on the Rosetti@Home.


Can you share the link you are using to see this?
ID: 94249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Applejacks

Send message
Joined: 20 May 17
Posts: 3
Credit: 11,758,466
RAC: 0
Message 94250 - Posted: 12 Apr 2020, 15:53:52 UTC - in response to Message 94249.  

try here or here
https://boinc.bakerlab.org/rosetta/top_users.php

https://boinc.bakerlab.org/rosetta/top_hosts.php
ID: 94250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1494
Credit: 14,706,505
RAC: 15,631
Message 94282 - Posted: 12 Apr 2020, 23:41:52 UTC - in response to Message 94246.  

Credit is based on the number of completed models reported and validated, for the specific batch of tasks being run. Each batch accumulates an average credit claim per model and this is used to grant credit as completed tasks are reported back.

This means that regardless of numbers of cores or threads, or how fast your bus or memory access is, or how large your L3 cache, the number of credits granted per model will be nearly identical for all users reporting results to that batch of work.
And the Credit granted for different types of work should so be the same (or at least very close). That's the theory behind Credit New, unfortunately it doesn't work that way as the large differences in Credit granted between different work types and the huge differences in APR for a give application on different systems, with the same CPU, show.
It is what it is.
Grant
Darwin NT
ID: 94282 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94300 - Posted: 13 Apr 2020, 0:31:45 UTC - in response to Message 94282.  

The degree to which batches differ from one-another can be huge. The variation of like system types is more likely due to the mix of work processed, especially over the span that RAC encompasses, than to the credit system. You also need to consider more than just "same CPU". How fast is the memory access and how much memory is available? Are there other applications contending for cache? What are the rates of page faults and pageouts? How fast is the storage for the page file? These varied factors are how they came up with credit based on useful, validated work (i.e. models). It is platform agnostic. The credit system doesn't care if your machine has a huge L3 cache, what it cares about is how much work your huge L3 cache lets you complete. It doesn't care if you hyperthread or not, it simply looks at the completed work.
Rosetta Moderator: Mod.Sense
ID: 94300 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1494
Credit: 14,706,505
RAC: 15,631
Message 94304 - Posted: 13 Apr 2020, 0:55:44 UTC - in response to Message 94300.  
Last modified: 13 Apr 2020, 0:57:45 UTC

The degree to which batches differ from one-another can be huge. The variation of like system types is more likely due to the mix of work processed, especially over the span that RAC encompasses, than to the credit system. You also need to consider more than just "same CPU". How fast is the memory access and how much memory is available? Are there other applications contending for cache? What are the rates of page faults and pageouts? How fast is the storage for the page file?
And those factors alone don't account for a factor of 4 (or more) difference in APRs. 20%, 30%, maybe 50%? Yeah. But 400%? No.


These varied factors are how they came up with credit based on useful, validated work (i.e. models). It is platform agnostic. The credit system doesn't care if your machine has a huge L3 cache, what it cares about is how much work your huge L3 cache lets you complete. It doesn't care if you hyperthread or not, it simply looks at the completed work.
Yep, and that shows just how broken Credit New is.
One of it's stated goals was to address the problems of some of the earlier versions. It shouldn't make any difference what application, hardware, operating system or combination thereof processes the work. The very definition of the Cobblestone meant that for a given amount of work (eg 100GFLOPs) you would get so many Credits, based on the performance of a reference system.
So regardless of whether you're processing on the slowest ARM system possible, or the fastest core speed x86 system around, regardless of the OS or the application, the work done is the same* and so the Creedit awarded for Valid work should be the same.
But it's not.

It's been a point of contention with Credit New ever since it was implemented (and why some projects don't use it).
It is what it is.




* for a task of fixed FLOPs
In the case of Rosetta where the number of FLOPs done isn't fixed, but the runtime is, with those 2 disparate systems if they both ran the Task for the same time, the the Credit awarded would be proportional to the amount of processing done.
If the faster system was 10,000 times faster than the slower system, it would get 10,000 times the Credit. 5 times faster, 5 times the Credit.
If the faster system ran the Task for 4 hours, it would get half the Credit it would get if it ran it for 8 hours. If it ran it for 16 hours, then it would get twice the Credit if it ran it for 8 hours.- the Credit awarded being in proportion to the work done.
And different tasks, while requiring different types of processing (not all FLOPs have the same overheads), the Credit granted to each should be the same (or at least very close) for doing the same number of FLOPs. In theory.
Grant
Darwin NT
ID: 94304 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2000
Credit: 38,577,290
RAC: 17,083
Message 94357 - Posted: 13 Apr 2020, 16:55:33 UTC - in response to Message 94229.  

This topic has reminded me of things I wanted to say over the last few weeks.

Dear Sid,
You are both on target and off target.

5) There was speculation about what would happen when former S@H landed on other projects. And yup, Projects that a lot of "us" have landed on have reported stretched servers. Welcome to the server world of S@H. :)

When I wrote my msg I'd been considering what'd happened over a few weeks.
To illustrate, the project was reporting 750k tasks completing each day with 1.55m tasks in progress, yet a stack of new users saying they couldn't get any tasks at all.
Yesterday, the project was reporting 920k tasks completing in 24hrs with 1.2m in progress and no new users saying they had no tasks to run for days.
Turnaround time has reduced from average 2.06 days to 1.3 days. And also no instances of the project reporting it was down due to the volume of server hits.
And both figures were taken at a moment of high task availability.

So, even within the constrained resources here, forcing some changes to upper and lower limits has absorbed a 5-fold increase in hosts that it, at first, struggled and occasionally failed to handle.

6) You are right. There has been a learning curve on the part of the experienced BOINC users.

Largely, people have taken the hint, because they are no mugs. Far from it. The technical quality of some conversations recently are a step change from before. We're better of for them on several levels.
The number of complaints from people with self-defeating settings has reduced from 20s to 1s - and largely it's only those whose settings hinder them who are complaining at all. They'll either learn or leave frustrated. Let's hope it's the former.

8) How would YOU feel if you couldn't get your box to crunch tasks? They were error-ing out over programming issues?

I'd listen more than preach, as I have for many years. There have always been programming issues. I glance at tasks that error out and try to report them while they're relevant. I'm sure none of them are deliberate.

Sid, hope you feel better now that you have unloaded.

Reviewing what I wrote, precisely none of the things I may've been off base about are relevant to the project. You explain they're credit-chasing reasons. Tough.
Where I was on target, progress has been spectacular tbh, so I'm very pleased.
And hopefully everyone understands the goals here much better, which is what we're here for. Good news.
ID: 94357 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom M

Send message
Joined: 20 Jun 17
Posts: 85
Credit: 5,411,032
RAC: 57,435
Message 94441 - Posted: 14 Apr 2020, 13:08:13 UTC - in response to Message 94357.  

[quote]This topic has reminded me of things I wanted to say over the last few weeks.

Dear Sid,
You are both on target and off target.


Sid,
Thank you for a courteous and thoughtful response.
+1

Tom M
Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel.....
ID: 94441 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tom M

Send message
Joined: 20 Jun 17
Posts: 85
Credit: 5,411,032
RAC: 57,435
Message 94443 - Posted: 14 Apr 2020, 13:48:43 UTC

About 1/3 of the top 100 crunchers here are AMD systems.
Many are Threadripper or EPYC server systems.

I even found an AMD 3950x (16c/32t) in the top 100 listing which does give me hope for mine :)

Tom M
Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel.....
ID: 94443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2000
Credit: 38,577,290
RAC: 17,083
Message 94456 - Posted: 14 Apr 2020, 15:57:51 UTC - in response to Message 94441.  

[quote]This topic has reminded me of things I wanted to say over the last few weeks.

Dear Sid,
You are both on target and off target.

Thank you for a courteous and thoughtful response.
+1

Even I didn't think I was <that> courteous, so I'll say the same to you - and mean it! lol
ID: 94456 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next

Message boards : Number crunching : If You Don't Know Where to Put it, Post it here.



©2024 University of Washington
https://www.bakerlab.org