Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 162 · 163 · 164 · 165 · 166 · 167 · 168 . . . 293 · Next

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5690
Credit: 5,859,226
RAC: 190
Message 104370 - Posted: 21 Jan 2022, 19:15:32 UTC - in response to Message 104361.  

I have posted this many times before. They should make it a sticky if there were any moderator around to do it.

If you are running VirtualBox 6.1.x, you will get the "Vm job unmanageable" problem with the pythons. That is true whether you are running Windows or Linux.
The difference is that it can be fixed with Windows. You go back to VirtualBox 5.2.44
https://www.virtualbox.org/wiki/Download_Old_Builds_5_2

Unfortunately, that does not work on Linux, at least Ubuntu. Firstly, Ubuntu 20.04.3 works only with VBox 6.1.x.
Secondly, even going back to Ubuntu 18.04.6, which allows you to install VBox 5.2.44, still has the problem.

They need to fix it at the project end, by compiling a new Vbox wrapper. They did it on LHC, and it works there. (It has to do with the COM interface, in case you are interested.)

NB: If you reboot frequently, you may not see the problem. It usually occurs after the pythons have been running 12 hours or so, but I have seen it even after a reboot on Ubuntu.



Jim, I went back to 6.1 and I do not have problems.
I can run all my projects there.
Going back to 5.2 is a good place to start for trouble shooting Python VM problms, but this can affect other projects.
ID: 104370 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,231,553
RAC: 1,264
Message 104371 - Posted: 21 Jan 2022, 19:28:57 UTC - in response to Message 104369.  

Pythons for 12 hours? They average 2 hours here.

As for only being able to run a couple at a time, you need a lot of RAM. I can run 5 but it won't do 6 in 16GB.

No, not an individual run for 12 hours. After running a series of them continually.
I didn't say anything about running only two. I usually run at least eight, and am presently running twenty on a Ryzen 3900X with 80 GB of memory.



Holy cow! 80 gigs?!?! That's more than my budget can afford!
It hurt enough to put in 32 on top of the 16 I put in about 4 years ago.

I just can't see investing much more for being a volunteer.
At best another 1080 or better, but that's it.
I have a Ryzen with 64GB, but it's my main computer. Less than that is pitiful by today's standards. It will take 128GB.
I have two Boinc only machines with 36GB in them. I upped them just enough to run LHC.
ID: 104371 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 104372 - Posted: 21 Jan 2022, 21:03:34 UTC - in response to Message 104371.  

I have a Ryzen with 64GB, but it's my main computer. Less than that is pitiful by today's standards. It will take 128GB.
I have two Boinc only machines with 36GB in them. I upped them just enough to run LHC.

Most projects don't take nearly as much as the pythons or LHC of course.

I like memory, but beyond 64 GB you have stability problems, since you have to use all four slots.
Sometimes it works, but you often have to juggle memory around. You may have to spend more than you anticipated.
Two slots is a lot safer.
ID: 104372 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,105,396
RAC: 1
Message 104373 - Posted: 21 Jan 2022, 21:10:55 UTC - in response to Message 104368.  

Robetta, as far as I can tell, is separate from Rosetta@home and is used mostly by researchers outside of the Baker Lab/IPD. It's an interface for users who wish to get computing power for their jobs.
Jobs that require the use of Rosetta 4.20 that are submitted to Robetta get sent to Rosetta@home but the rest goes to the other servers that they set up when they launched RoseTTAFold.



Ok..so then where do they get the million something tasks in queue?
But yet there appears to be only a few thousand released?
There has always been a million something in queue, even back when it was just 4.2 alone.
So something doesn't add up.
And that you can't see what is next in line....but yet you can see Robetta?
Plus someone kept quoting Robetta information some time ago as if that was where RAH gets its work.


The queue we see in the Rosetta@home page represents the jobs that the IPD directly submits to run at Rosetta@home + whatever gets submitted at Rosetta to run on Rosetta@home, the rb_11111_11111 jobs.

If I could see the Rosetta@home queue, it would likely be close to 100% Rosetta Python jobs. The Pythons are refilled from the queue up to a max of 5,000 ready to send. I don't know why (server resources constraints?) but it's not like Rosetta@home can do a lot of these at any given time so no point in increasing that value, So yeah, those 2.6 million jobs on the queue are Pythons. Sid Celery posted something a few months ago that he received from Admin or someone like that who said that the Python job that had been submitted by one of the IPD researchers was "huge".

I think one of the features on Robetta is to make sure that everything is open - that is, any researcher can see what is being worked on both present and past and maybe avoid duplication of work. I recall during the pandemic that they asked Robetta users to make sure their jobs were visible to others so that everyone could benefit.
(I have a suspicion that researchers can hide their jobs - often times, I try to search for the jobs I'm running on my computers using the ID number but Robetta doesn't return any results).

I don't know who said that but my impression has always been that work that comes from Robetta is labelled rb and everything else that isn't labelled rb is directly submitted to Rosetta@home.
ID: 104373 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,231,553
RAC: 1,264
Message 104375 - Posted: 21 Jan 2022, 22:08:20 UTC - in response to Message 104372.  
Last modified: 21 Jan 2022, 22:08:46 UTC

Most projects don't take nearly as much as the pythons or LHC of course.

I like memory, but beyond 64 GB you have stability problems, since you have to use all four slots.
Sometimes it works, but you often have to juggle memory around. You may have to spend more than you anticipated.
Two slots is a lot safer.
Everything works better with more memory, if you're not using it you get a massive disk cache. Using all four slots does not cause stability problems. Always test your new memory with memtest before use, even quality stuff has duds.
ID: 104375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,231,553
RAC: 1,264
Message 104376 - Posted: 21 Jan 2022, 22:11:10 UTC - in response to Message 104373.  

Sid Celery posted something a few months ago that he received from Admin or someone like that who said that the Python job that had been submitted by one of the IPD researchers was "huge".
It's not that big, it's only a few million tasks. I've seen the queue at 15 million. But maybe that was several projects at once.
ID: 104376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1611
Credit: 16,521,941
RAC: 3,372
Message 104377 - Posted: 21 Jan 2022, 22:16:09 UTC - in response to Message 104368.  

Ok..so then where do they get the million something tasks in queue?
But yet there appears to be only a few thousand released?
There has always been a million something in queue, even back when it was just 4.2 alone.
So something doesn't add up.

Back when there were Rosetta 4.20 Tasks, all those millions were Rosetta 4.30 Tasks and if you checked the Total queued jobs number it would gradually run down to zero (or jump up again as new work was released).

Now most of the work is Python, and that's what that number shows. Extremely occasionally it jumps up again when that extremely rare batch of Rosetta 4.20 work is released.,
However most of the time it sits around the 2-2.7 million mark, this is because the amount of Python work being done is being done at roughly the same rate as new Python work is released.


The Unset value in the Application task list is the amount of work that's ready to go for that particular application (i think the ratio is 6:1 Rosetta 4.20:Python). The Tasks ready to send value under the Computing Status is the Rosetta 4.20 & Python Tasks by application values combined.
The Total queued jobs value is both the Rosetta 4.20 work, and the Python work and both the Unset 4.20 & Python work all combined. It is the total of all types of work at all stages of yet to be processed.
Grant
Darwin NT
ID: 104377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,105,396
RAC: 1
Message 104378 - Posted: 21 Jan 2022, 22:45:26 UTC - in response to Message 104376.  

Sid Celery posted something a few months ago that he received from Admin or someone like that who said that the Python job that had been submitted by one of the IPD researchers was "huge".
It's not that big, it's only a few million tasks. I've seen the queue at 15 million. But maybe that was several projects at once.



You are correct but these 2 million tasks will take a long time to finish at the current rate because only 15,000 or so are running at any given point.
That's why it's "huge"
ID: 104378 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1611
Credit: 16,521,941
RAC: 3,372
Message 104379 - Posted: 21 Jan 2022, 23:00:30 UTC - in response to Message 104378.  

Sid Celery posted something a few months ago that he received from Admin or someone like that who said that the Python job that had been submitted by one of the IPD researchers was "huge".
It's not that big, it's only a few million tasks. I've seen the queue at 15 million. But maybe that was several projects at once.
You are correct but these 2 million tasks will take a long time to finish at the current rate because only 15,000 or so are running at any given point.
That's why it's "huge"
Yep.
Roughly 1 in 133 is being processed. Compared to Rosetta 4.20 at their peak (20 million queued up, 400k in progress) 1 in 50 being processed.

And given the huge issues with Python Tasks, such as those that sit there not actually using any CPU time so they're not actually being processed, i'd suggest that 1 in 133 value in reality is way, way, waaaay worse than that.
Grant
Darwin NT
ID: 104379 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,231,553
RAC: 1,264
Message 104380 - Posted: 21 Jan 2022, 23:26:40 UTC - in response to Message 104378.  

You are correct but these 2 million tasks will take a long time to finish at the current rate because only 15,000 or so are running at any given point.
That's why it's "huge"
In the same way as this microwave is huge compared to that sofa, because I'm carrying it on my bicycle instead of the car.
ID: 104380 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5690
Credit: 5,859,226
RAC: 190
Message 104381 - Posted: 22 Jan 2022, 0:02:19 UTC - in response to Message 104371.  

Pythons for 12 hours? They average 2 hours here.

As for only being able to run a couple at a time, you need a lot of RAM. I can run 5 but it won't do 6 in 16GB.

No, not an individual run for 12 hours. After running a series of them continually.
I didn't say anything about running only two. I usually run at least eight, and am presently running twenty on a Ryzen 3900X with 80 GB of memory.



Holy cow! 80 gigs?!?! That's more than my budget can afford!
It hurt enough to put in 32 on top of the 16 I put in about 4 years ago.

I just can't see investing much more for being a volunteer.
At best another 1080 or better, but that's it.
I have a Ryzen with 64GB, but it's my main computer. Less than that is pitiful by today's standards. It will take 128GB.
I have two Boinc only machines with 36GB in them. I upped them just enough to run LHC.


Once I get my new drive installed this weekend, I should be able to undo the restriction I have right now on python and with the current memory, I should be able to run a few more pythons plus all my other projects or a full load of pythons (16) and have a little bit of memory left over.
ID: 104381 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5690
Credit: 5,859,226
RAC: 190
Message 104382 - Posted: 22 Jan 2022, 0:06:43 UTC - in response to Message 104379.  

Sid Celery posted something a few months ago that he received from Admin or someone like that who said that the Python job that had been submitted by one of the IPD researchers was "huge".
It's not that big, it's only a few million tasks. I've seen the queue at 15 million. But maybe that was several projects at once.
You are correct but these 2 million tasks will take a long time to finish at the current rate because only 15,000 or so are running at any given point.
That's why it's "huge"
Yep.
Roughly 1 in 133 is being processed. Compared to Rosetta 4.20 at their peak (20 million queued up, 400k in progress) 1 in 50 being processed.

And given the huge issues with Python Tasks, such as those that sit there not actually using any CPU time so they're not actually being processed, i'd suggest that 1 in 133 value in reality is way, way, waaaay worse than that.



Well you've seen the numbers. People come and try it out and leave.
Others can't get it to work and leave.
Without the staff taking notice or caring, it will be a downward to stable trend of systems instead of upward.
But again, they don't care about numbers, just as long as the work gets done eventually.
ID: 104382 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5690
Credit: 5,859,226
RAC: 190
Message 104383 - Posted: 22 Jan 2022, 0:10:59 UTC - in response to Message 104375.  

Most projects don't take nearly as much as the pythons or LHC of course.

I like memory, but beyond 64 GB you have stability problems, since you have to use all four slots.
Sometimes it works, but you often have to juggle memory around. You may have to spend more than you anticipated.
Two slots is a lot safer.
Everything works better with more memory, if you're not using it you get a massive disk cache. Using all four slots does not cause stability problems. Always test your new memory with memtest before use, even quality stuff has duds.


I have 49 and change spread out over 4 slots.
Everything works as it should.
The new drive is 500 gigs and it will be dedicated to BOINC
So there is more than enough room for swap or whatever else BOINC wants to do.
ID: 104383 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5690
Credit: 5,859,226
RAC: 190
Message 104384 - Posted: 22 Jan 2022, 0:12:54 UTC

Total queued jobs: 2,589,661
In progress: 53,882
Successes last 24h: 34,678

that's what the page says.
Pretty small numbers against the 2 mill.
ID: 104384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 104385 - Posted: 22 Jan 2022, 3:22:59 UTC - in response to Message 104375.  

Everything works better with more memory, if you're not using it you get a massive disk cache. Using all four slots does not cause stability problems. Always test your new memory with memtest before use, even quality stuff has duds.

You get a write cache only if you install one. PrimoCache is the only one that I know of for Windows, which I use to protect my SSDs.

Memtest really doesn't have much to do with stability. It is mainly for errors, which might cause crashes, but more likely failures in work units.
With large amounts of memory, especially the two-sided memory modules, you will see many more crashes using four slots. Check the forums.
ID: 104385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1611
Credit: 16,521,941
RAC: 3,372
Message 104386 - Posted: 22 Jan 2022, 3:57:28 UTC - in response to Message 104385.  
Last modified: 22 Jan 2022, 4:06:35 UTC

You get a write cache only if you install one. PrimoCache is the only one that I know of for Windows, which I use to protect my SSDs.
?
The default setting for Windows is write caching enabled.
If you want to set it's size (other than doing registry hacks), then you'd need a 3rd party one.



With large amounts of memory, especially the two-sided memory modules, you will see many more crashes using four slots. Check the forums.
I've had systems with only 2 slots used & memory problems. I've had systems with all slots used & no problems.
While the more components, the greater the likely hood of failure, the biggest cause of issues with more than 2 modules is people pushing the RAM too hard.
Yes, 2 modules allows you tighter timings and higher clocks. But as long as you use modules of the same brand & model, and don't push them beyond their rated clocks & timings, you won't have any issues.
Look at server systems that may have 32 (or more) DIMM slots.
Grant
Darwin NT
ID: 104386 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 104387 - Posted: 22 Jan 2022, 6:37:57 UTC - in response to Message 104386.  
Last modified: 22 Jan 2022, 6:53:17 UTC

The default setting for Windows is write caching enabled.
If you want to set it's size (other than doing registry hacks), then you'd need a 3rd party one.
That is just the cache on the disk drive itself, and is relatively small. These days, it is often just a faster section of the flash memory (e.g., two-level instead of four level or more).
Therefore, it is subject to the same wearout mechanism, just a bit more slowly.

But using a portion of main memory as the write cache is much faster, and will protect the SSD from the high level of writes, such as on the pythons.
And it can be very large. I usually use at least 8 GB. I posted on it in another topic.


I've had systems with only 2 slots used & memory problems. I've had systems with all slots used & no problems.
While the more components, the greater the likely hood of failure, the biggest cause of issues with more than 2 modules is people pushing the RAM too hard.
Yes, 2 modules allows you tighter timings and higher clocks. But as long as you use modules of the same brand & model, and don't push them beyond their rated clocks & timings, you won't have any issues.
Look at server systems that may have 32 (or more) DIMM slots.
I have had much more experience. And the larger the CPU, the worse the problems. With two Ryzen 3900X and two Ryzen 3950X, I have seen them all. It saved me some grief with the 5900 series.
ID: 104387 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1611
Credit: 16,521,941
RAC: 3,372
Message 104388 - Posted: 22 Jan 2022, 8:07:15 UTC - in response to Message 104387.  

The default setting for Windows is write caching enabled.
If you want to set it's size (other than doing registry hacks), then you'd need a 3rd party one.
That is just the cache on the disk drive itself, and is relatively small. These days, it is often just a faster section of the flash memory (e.g., two-level instead of four level or more).
Therefore, it is subject to the same wearout mechanism, just a bit more slowly.

But using a portion of main memory as the write cache is much faster, and will protect the SSD from the high level of writes, such as on the pythons.
And it can be very large. I usually use at least 8 GB. I posted on it in another topic.
Every article i've seen about the WIn10 write caching says it is using system RAM to cache writes- it has nothing to do with the drive's own onboard buffering.
Grant
Darwin NT
ID: 104388 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 104393 - Posted: 22 Jan 2022, 13:06:00 UTC - in response to Message 104388.  
Last modified: 22 Jan 2022, 13:11:14 UTC

Every article i've seen about the WIn10 write caching says it is using system RAM to cache writes- it has nothing to do with the drive's own onboard buffering.

I think you are confusing that with read caches, but I will look. If it were caching writes, you would probably know it.
If the cached writes were save to disk, it would take a long time to shut down, for example. And the programs that show the writes to disk would indicate it. I don't see it.

Read caches are easier to implement, but less necessary. They don't save the SSD from excessive writes. And the reads from SSDs are fast anyway, so the read caches are not all that necessary.

EDIT: The only thing I see is this.
https://www.windowscentral.com/how-manage-disk-write-caching-external-storage-windows-10
That is just disk write caching, as I previously discussed. It uses only a small amount of memory, not the GB that you need to protect the SSDs from the pythons.
The write rates on the pythons are horrendous. I am getting well over 1 TB/day (almost 2 TB) when running 20 pythons, even with a huge 26 GB write cache. That is too much. I will do something else with this machine.
ID: 104393 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 104394 - Posted: 22 Jan 2022, 14:31:13 UTC - in response to Message 104393.  
Last modified: 22 Jan 2022, 14:34:01 UTC

By the way, I used to just put projects with high write rates on a ramdisk, and have all the writes go to main memory.
That really solves the problem. But on the Ryzen 3900X with all the pythons, the BOINC data folder is 107 GB; too much.

I might be able to pull it off on a Ryzen 3600 though; 12 virtual cores might work.

But I think they really need to develop the pythons a bit and call back when they are ready.
ID: 104394 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 162 · 163 · 164 · 165 · 166 · 167 · 168 . . . 293 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org