Message boards : Number crunching : If You Don't Know Where to Put it, Post it here.
Previous · 1 · 2 · 3 · 4 · 5 . . . 12 · Next
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I don't want to sound over confident that I have a perfect understanding of all of the relevant facts... but yes. To me, the symptoms you report with tasks running longer than runtime preference, and resetting to zero when you restart the machine, (which implies they never reached completion of the first model), it all sounds like different ways of describing the same problem. You noticed the progress resetting when your machine is shutdown overnight. Others reported it as low credit for WUs, and when I look at them they all seem to have been ended by the watchdog (just like yours) and don't seem to have completed their first model (just like yours). I mean I don't see any unexplained symptoms that preclude these being the same root cause. Unfortunately, given all of the work they have on the table to assimilate the COVID results coming in, it sounds like it may be some time before they can work on a fix. I can't even guess whether that means weeks or months. So the suggestion is to let the machine explore some other BOINC projects for a while (you can just mark R@h for "no new tasks", and leave the project on the list), and add Ralph to the list of projects, (use a low resource share since work is sparse), this assures they get some i686 Linux hosts in their next round of testing. Rosetta Moderator: Mod.Sense |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,114,842 RAC: 4,200 |
Greetings, Guarantee that it does not on my systems - during the course of today I’ve brought two of my systems down and restarted and all of the Rosetta tasks restarted more or less where they left off. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
A quick peek at Bryn Mawr's 3 hosts, which he says are not seeing the problem, and looking at a random completed task from each, and they all are running the rosetta_4.12_x86_64-pc-linux-gnu application, not the i686. Rosetta Moderator: Mod.Sense |
Siran d'Vel'nahr Send message Joined: 15 Nov 06 Posts: 72 Credit: 2,674,678 RAC: 0 |
Hi Bryn, Guarantee that it does not on my systems - during the course of today I’ve brought two of my systems down and restarted and all of the Rosetta tasks restarted more or less where they left off. Good, then it looks like you are running a different app on the tasks than I am. The i686 app seems to have an issue which is what my tasks are running on. Thanks and have a great day! :) Siran CAPT Siran d'Vel'nahr XO USS Vre'kasht NCC-33187 "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
Siran d'Vel'nahr Send message Joined: 15 Nov 06 Posts: 72 Credit: 2,674,678 RAC: 0 |
A quick peek at Bryn Mawr's 3 hosts, which he says are not seeing the problem, and looking at a random completed task from each, and they all are running the rosetta_4.12_x86_64-pc-linux-gnu application, not the i686. Hi Mod, LOL! You posted just before I posted my reply to Byrn's post. By the way, I already have Rosetta set to NNT. :) Thanks and have a great day! :) Siran CAPT Siran d'Vel'nahr XO USS Vre'kasht NCC-33187 "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,114,842 RAC: 4,200 |
A quick peek at Bryn Mawr's 3 hosts, which he says are not seeing the problem, and looking at a random completed task from each, and they all are running the rosetta_4.12_x86_64-pc-linux-gnu application, not the i686. I’ve had quite a few of the i686 tasks, I guess there were none when I bounced the machines :-) |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
This topic has reminded me of things I wanted to say over the last few weeks. Having COVID19 tasks here has brought the arrival of lots of completely new users to Boinc, but almost coincidentally a lot of <highly> experienced Boinc users from Seti, bringing with them some mega crunching power. And together, the number of active users has gone through the roof, the number of tasks burned through has followed and together they've stretched the demands of the project. From 40k active hosts to 240k, with credits awarded going up 6 times. But it's also brought up some issues. Seti is not Rosetta. Not by a long chalk. A lot of the assumptions learned at Seti don't apply at Rosetta. Some are the opposite. And that's brought some conflict with it. So when I read users saying Seti allowed longer deadlines so Rosetta should extend their deadlines to suit users, not the project, that's not right. And when I read users saying 'I've got a megacrunching behemoth and I should be completing tasks quicker, but I don't, so the tasks aren't running properly and Rosetta needs to sort it out', that's not right. (Your behemoth is running more within the target runtime than slower machines so you get higher credit that way) And when users say I'm used to hoarding weeks of tasks to meet longer deadlines but now I don't have enough (to the exclusion of others when tasks are limited), that's not right. Rosetta actually does something with their results, looks at those returned and has the ability to tweak the next batch of tasks to account for the previous iteration, so a quick turnaround time of completed tasks is highly important. And when I see a comment like, I know Rosetta because I ran it in 2006, 1 year after the project launched, so in 2020 it should run the same. That's not right. And if the graphics don't run on a certain platform and the credits are a bit skewed with a new program version and if a certain batch crashes out or the validation fails and the servers conk out, that's clearly wrong and it'll get sorted sooner or later but it's not the priority. And if it's too much for users to bear and you feel like you have to go off in a huff almost as if satisfying you as a user is a higher priority than the current purpose of the project, instead of having a little patience, reporting the shortcomings so things can be revised properly next time round and committing just as much to the next round of tasks as you did the last, then too bad. You're not running Seti any more, with all the assumptions you learned to run there successfully. At Rosetta, some of those assumptions are different, so keep an eye out, cooperate with the needs of this project, not some whole other project with an entire different goalpurposetime-scaleset of priorities, and bring that wealth of expertise and experience along as well and things will be better for everyone. Yeah, condescending, I know. Save it. |
Siran d'Vel'nahr Send message Joined: 15 Nov 06 Posts: 72 Credit: 2,674,678 RAC: 0 |
Yeah, condescending, I know. Save it. Condescending is an understatement! Much of what you said was posted in this thread has been exaggerated by you or is outright false. If you would have read, an issue has been discovered with the v4.12 i686 app and/or the corresponding tasks and, it could take weeks or months for a fix to show up. CAPT Siran d'Vel'nahr XO USS Vre'kasht NCC-33187 "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath |
Tom M Send message Joined: 20 Jun 17 Posts: 87 Credit: 15,306,243 RAC: 40,050 |
This topic has reminded me of things I wanted to say over the last few weeks. Dear Sid, You are both on target and off target. 1) Many of the "Mega-crunchers" that have joined here have upto 14 gpu systems. Some of them don't crunch cpu tasks at all. 1a) The original "spoofing" techniques were only useful for gpu tasks. The 128 thread highly productive systems at S@H were in a minority and only a few were even in the top 100 of the leader board. 2) I am not not sure some of the bunkering tools from S@H even apply here. They were used to switch cpu tasks to the gpu etc. 3) Other bunkering tools may not even work here. One of the purposes of "bunkering" was for contests or gaining higher RAC or Totals at the end of the S@H project or allowing us to process while the S@H server was down. S@H went down every Tuesday for 3-12 hours. 4) The type of tasks are changed depending on the results. This is very encouraging to have that level of sensitivity to crunching results. We didn't have that at S@H. 5) There was speculation about what would happen when former S@H landed on other projects. And yup, Projects that a lot of "us" have landed on have reported stretched servers. Welcome to the server world of S@H. :) 6) You are right. There has been a learning curve on the part of the experienced BOINC users. 7) Most of the Seti Orphans are running 16 cpu thread or less machines. So for the most part you are getting the results of a LOT of 4c/6c/8c machines joining. These participants were NOT mega crunchers. 8) How would YOU feel if you couldn't get your box to crunch tasks? They were error-ing out over programming issues? Sid, hope you feel better now that you have unloaded. Respectfully Tom M (one of those "mega" crunchers) ( yes I am crunching 18 threads@8 hours a task on a 16c/32t box and have NOT run dry [yet]) Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel..... |
Tom M Send message Joined: 20 Jun 17 Posts: 87 Credit: 15,306,243 RAC: 40,050 |
I just took a look at the current list of "Mega Crunchers" systems on the Rosetti@Home. In the top 20 I don't recognize a single "type" of system that was/is common at Seti@Home. Instead I see a huge number of "very high thread count" systems (upwards to 192 cores/threads). Another thing to note is the top producers are not unanimous in running the default 8 hours setting. Instead a number of them are running 3 hours or less tasks. If 8 hour tasks really do stress the the server(s) less and I have every reason to believe they do. Perhaps we should start "nagging" (in private or setup a public shame list?) for every top producer (say the top 100) not running at least the 8 hour default to start running the default. Presumably they would get the same credit scores but access the server(s) less often? And given the volume they may be running it should lower the load on the server(s) significantly. Respectfully, Tom M Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel..... |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Credit is based on the number of completed models reported and validated, for the specific batch of tasks being run. Each batch accumulates an average credit claim per model and this is used to grant credit as completed tasks are reported back. This means that regardless of numbers of cores or threads, or how fast your bus or memory access is, or how large your L3 cache, the number of credits granted per model will be nearly identical for all users reporting results to that batch of work. It also means that faster CPUs (where "faster" means lower time to complete each R@h model), will complete more models per hour of runtime, and be granted more credit per hour of runtime. The models are no different in the last hour of running a WU than the first, regardless of your runtime preference. Rosetta Moderator: Mod.Sense |
chrislaf Send message Joined: 28 Mar 20 Posts: 3 Credit: 1,487,482 RAC: 0 |
I just took a look at the current list of "Mega Crunchers" systems on the Rosetti@Home. Can you share the link you are using to see this? |
Applejacks Send message Joined: 20 May 17 Posts: 3 Credit: 11,758,466 RAC: 0 |
try here or here https://boinc.bakerlab.org/rosetta/top_users.php https://boinc.bakerlab.org/rosetta/top_hosts.php |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,957,902 RAC: 23,323 |
Credit is based on the number of completed models reported and validated, for the specific batch of tasks being run. Each batch accumulates an average credit claim per model and this is used to grant credit as completed tasks are reported back.And the Credit granted for different types of work should so be the same (or at least very close). That's the theory behind Credit New, unfortunately it doesn't work that way as the large differences in Credit granted between different work types and the huge differences in APR for a give application on different systems, with the same CPU, show. It is what it is. Grant Darwin NT |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The degree to which batches differ from one-another can be huge. The variation of like system types is more likely due to the mix of work processed, especially over the span that RAC encompasses, than to the credit system. You also need to consider more than just "same CPU". How fast is the memory access and how much memory is available? Are there other applications contending for cache? What are the rates of page faults and pageouts? How fast is the storage for the page file? These varied factors are how they came up with credit based on useful, validated work (i.e. models). It is platform agnostic. The credit system doesn't care if your machine has a huge L3 cache, what it cares about is how much work your huge L3 cache lets you complete. It doesn't care if you hyperthread or not, it simply looks at the completed work. Rosetta Moderator: Mod.Sense |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,957,902 RAC: 23,323 |
The degree to which batches differ from one-another can be huge. The variation of like system types is more likely due to the mix of work processed, especially over the span that RAC encompasses, than to the credit system. You also need to consider more than just "same CPU". How fast is the memory access and how much memory is available? Are there other applications contending for cache? What are the rates of page faults and pageouts? How fast is the storage for the page file?And those factors alone don't account for a factor of 4 (or more) difference in APRs. 20%, 30%, maybe 50%? Yeah. But 400%? No. These varied factors are how they came up with credit based on useful, validated work (i.e. models). It is platform agnostic. The credit system doesn't care if your machine has a huge L3 cache, what it cares about is how much work your huge L3 cache lets you complete. It doesn't care if you hyperthread or not, it simply looks at the completed work.Yep, and that shows just how broken Credit New is. One of it's stated goals was to address the problems of some of the earlier versions. It shouldn't make any difference what application, hardware, operating system or combination thereof processes the work. The very definition of the Cobblestone meant that for a given amount of work (eg 100GFLOPs) you would get so many Credits, based on the performance of a reference system. So regardless of whether you're processing on the slowest ARM system possible, or the fastest core speed x86 system around, regardless of the OS or the application, the work done is the same* and so the Creedit awarded for Valid work should be the same. But it's not. It's been a point of contention with Credit New ever since it was implemented (and why some projects don't use it). It is what it is. * for a task of fixed FLOPs In the case of Rosetta where the number of FLOPs done isn't fixed, but the runtime is, with those 2 disparate systems if they both ran the Task for the same time, the the Credit awarded would be proportional to the amount of processing done. If the faster system was 10,000 times faster than the slower system, it would get 10,000 times the Credit. 5 times faster, 5 times the Credit. If the faster system ran the Task for 4 hours, it would get half the Credit it would get if it ran it for 8 hours. If it ran it for 16 hours, then it would get twice the Credit if it ran it for 8 hours.- the Credit awarded being in proportion to the work done. And different tasks, while requiring different types of processing (not all FLOPs have the same overheads), the Credit granted to each should be the same (or at least very close) for doing the same number of FLOPs. In theory. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
This topic has reminded me of things I wanted to say over the last few weeks. When I wrote my msg I'd been considering what'd happened over a few weeks. To illustrate, the project was reporting 750k tasks completing each day with 1.55m tasks in progress, yet a stack of new users saying they couldn't get any tasks at all. Yesterday, the project was reporting 920k tasks completing in 24hrs with 1.2m in progress and no new users saying they had no tasks to run for days. Turnaround time has reduced from average 2.06 days to 1.3 days. And also no instances of the project reporting it was down due to the volume of server hits. And both figures were taken at a moment of high task availability. So, even within the constrained resources here, forcing some changes to upper and lower limits has absorbed a 5-fold increase in hosts that it, at first, struggled and occasionally failed to handle. 6) You are right. There has been a learning curve on the part of the experienced BOINC users. Largely, people have taken the hint, because they are no mugs. Far from it. The technical quality of some conversations recently are a step change from before. We're better of for them on several levels. The number of complaints from people with self-defeating settings has reduced from 20s to 1s - and largely it's only those whose settings hinder them who are complaining at all. They'll either learn or leave frustrated. Let's hope it's the former. 8) How would YOU feel if you couldn't get your box to crunch tasks? They were error-ing out over programming issues? I'd listen more than preach, as I have for many years. There have always been programming issues. I glance at tasks that error out and try to report them while they're relevant. I'm sure none of them are deliberate. Sid, hope you feel better now that you have unloaded. Reviewing what I wrote, precisely none of the things I may've been off base about are relevant to the project. You explain they're credit-chasing reasons. Tough. Where I was on target, progress has been spectacular tbh, so I'm very pleased. And hopefully everyone understands the goals here much better, which is what we're here for. Good news. |
Tom M Send message Joined: 20 Jun 17 Posts: 87 Credit: 15,306,243 RAC: 40,050 |
[quote]This topic has reminded me of things I wanted to say over the last few weeks. Sid, Thank you for a courteous and thoughtful response. +1 Tom M Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel..... |
Tom M Send message Joined: 20 Jun 17 Posts: 87 Credit: 15,306,243 RAC: 40,050 |
About 1/3 of the top 100 crunchers here are AMD systems. Many are Threadripper or EPYC server systems. I even found an AMD 3950x (16c/32t) in the top 100 listing which does give me hope for mine :) Tom M Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel..... |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
[quote]This topic has reminded me of things I wanted to say over the last few weeks. Even I didn't think I was <that> courteous, so I'll say the same to you - and mean it! lol |
Message boards :
Number crunching :
If You Don't Know Where to Put it, Post it here.
©2024 University of Washington
https://www.bakerlab.org