Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 212 · 213 · 214 · 215 · 216 · 217 · 218 . . . 219 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1829
Credit: 33,774,937
RAC: 7,531
Message 106262 - Posted: 24 May 2022, 22:33:14 UTC

I've got one weird Python task that's been running now for 26hrs, but it is using the CPU 25.5hrs and has checkpointed regularly - most recently 8 minutes ago.
I've got no idea why it won't end itself.
Does the watchdog no longer work?
CPU time 1d 02:32:21
CPU time since checkpoint 00:08:12
Elapsed time 1d 01:32:50
Estimated time remaining 01:05:35
Fraction done 95.897%
Virtual memory size 98.97 MB
Working set size 2.79 GB

I'm going to abort it now and see what it reports
It should show here aagb-HPR_pp-NMPHE-GPN_pp-BPRO_pp_6_2605012_6_1
ID: 106262 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 109
Credit: 94,002
RAC: 21
Message 106263 - Posted: 24 May 2022, 22:36:22 UTC - in response to Message 106262.  

does .out file in c:programdataboincslots[slot number here]shared change?
ID: 106263 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1829
Credit: 33,774,937
RAC: 7,531
Message 106264 - Posted: 24 May 2022, 22:59:39 UTC - in response to Message 106262.  

CPU time 1d 02:32:21
CPU time since checkpoint 00:08:12
Elapsed time 1d 01:32:50
Estimated time remaining 01:05:35
Fraction done 95.897%
Virtual memory size 98.97 MB
Working set size 2.79 GB

I'm going to abort it now and see what it reports
It should show here aagb-HPR_pp-NMPHE-GPN_pp-BPRO_pp_6_2605012_6_1

Apologies, it's this task, not the one shown above
aagb-PHE_pp-mPIP-GGLY-mB3LEU_3_2686388_6_0
Run time 1 days 2 hours 37 min 11 sec
CPU time 1 days 2 hours 37 min 11 sec
Validate state Invalid
Application version rosetta python projects v1.03 (vbox64)
windows_x86_64
Peak working set size 96.44 MB
Peak swap size 195.97 MB
Peak disk usage 7,948.44 MB

Can anyone spot the error in the task?
ID: 106264 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1829
Credit: 33,774,937
RAC: 7,531
Message 106265 - Posted: 24 May 2022, 23:05:55 UTC - in response to Message 106263.  

does .out file in c:programdataboincslots[slot number here]shared change?

Sorry, I didn't see this, but neither do I know what .out file I should look at, nor what slot it was running in, nor know if or how it might've changed.
Task aborted now - I assume the info has gone now?
ID: 106265 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1176
Credit: 13,195,130
RAC: 5,044
Message 106266 - Posted: 24 May 2022, 23:08:41 UTC - in response to Message 106264.  

CPU time 1d 02:32:21
CPU time since checkpoint 00:08:12
Elapsed time 1d 01:32:50
Estimated time remaining 01:05:35
Fraction done 95.897%
Virtual memory size 98.97 MB
Working set size 2.79 GB

I'm going to abort it now and see what it reports
It should show here aagb-HPR_pp-NMPHE-GPN_pp-BPRO_pp_6_2605012_6_1

Apologies, it's this task, not the one shown above
aagb-PHE_pp-mPIP-GGLY-mB3LEU_3_2686388_6_0
Run time 1 days 2 hours 37 min 11 sec
CPU time 1 days 2 hours 37 min 11 sec
Validate state Invalid
Application version rosetta python projects v1.03 (vbox64)
windows_x86_64
Peak working set size 96.44 MB
Peak swap size 195.97 MB
Peak disk usage 7,948.44 MB

Can anyone spot the error in the task?

No error I can spot before this line, then several:

Hypervisor System Log:

However, these can be due to the abort.

It may be a task that ran much longer than expected, without anything going wrong. If so, just letting it run enough longer would have let it finish.
ID: 106266 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1176
Credit: 13,195,130
RAC: 5,044
Message 106267 - Posted: 24 May 2022, 23:19:07 UTC - in response to Message 106265.  

does .out file in c:programdataboincslots[slot number here]shared change?

Sorry, I didn't see this, but neither do I know what .out file I should look at, nor what slot it was running in, nor know if or how it might've changed.
Task aborted now - I assume the info has gone now?

To find the slot number click on the task in the tasks column, them on properties.

The info is gone shortly after the output files are uploaded and the task is reported as finished.

The probable change to look for is any change to the dates and size of the .out file.

If there is more than out .out file in the slot directory, look for changes in the dates or size in all of them.
ID: 106267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 109
Credit: 94,002
RAC: 21
Message 106268 - Posted: 24 May 2022, 23:21:45 UTC
Last modified: 24 May 2022, 23:22:30 UTC

you can copy out file twice waiting several minutes between copies and then compare two copies with winmerge .
ID: 106268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 179
Credit: 23,122,300
RAC: 897
Message 106269 - Posted: 25 May 2022, 0:19:29 UTC

Looks like Rosetta 4.2 just got a batch of `miniprotein in , grab them while they iz hot
front page job que went up by millions .
ID: 106269 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 63
Credit: 3,934,788
RAC: 551
Message 106270 - Posted: 25 May 2022, 1:54:56 UTC - in response to Message 106269.  

Looks like Rosetta 4.2 just got a batch of `miniprotein in , grab them while they iz hot
front page job que went up by millions .


I just got 25 4.2 work units and five are currently running.
Mine are regular work units, not Rosetta mini work units. But the tasks look like this:

Tue 24 May 2022 09:35:07 PM EDT | Rosetta@home | Starting task miniprotein_relax_v2_1_SAVE_ALL_OUT_IGNORE_THE_REST_5yb7eb8g_2914917_13_0
ID: 106270 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 12 Aug 06
Posts: 1181
Credit: 5,644,809
RAC: 2,152
Message 106272 - Posted: 25 May 2022, 9:18:03 UTC - in response to Message 106266.  

It may be a task that ran much longer than expected, without anything going wrong. If so, just letting it run enough longer would have let it finish.
I always leave them running unless the CPU is not actually being used. In that one, "CPU time 1d 02:32:21" I assume refers to real calculations, and "Elapsed time 1d 01:32:50" refers to actual time taken. I'm not familiar with wherever that came from, I use Boinctasks. So I think that one was calculating on a whole CPU core.
ID: 106272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 12 Aug 06
Posts: 1181
Credit: 5,644,809
RAC: 2,152
Message 106273 - Posted: 25 May 2022, 9:20:08 UTC - in response to Message 106270.  

I just got 25 4.2 work units and five are currently running.
Mine are regular work units, not Rosetta mini work units. But the tasks look like this:

Tue 24 May 2022 09:35:07 PM EDT | Rosetta@home | Starting task miniprotein_relax_v2_1_SAVE_ALL_OUT_IGNORE_THE_REST_5yb7eb8g_2914917_13_0
Same here, I have about 80, some are rb (I think I got those by chance just before the onslaught) some are miniprotein, all labelled Rosetta 4.2 as the application though. So a small protein but not a small work unit?
ID: 106273 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1176
Credit: 13,195,130
RAC: 5,044
Message 106274 - Posted: 25 May 2022, 13:11:16 UTC - in response to Message 106272.  
Last modified: 25 May 2022, 13:13:17 UTC

It may be a task that ran much longer than expected, without anything going wrong. If so, just letting it run enough longer would have let it finish.
I always leave them running unless the CPU is not actually being used. In that one, "CPU time 1d 02:32:21" I assume refers to real calculations, and "Elapsed time 1d 01:32:50" refers to actual time taken. I'm not familiar with wherever that came from, I use Boinctasks. So I think that one was calculating on a whole CPU core.

CPU time is probably time used according to the small operating system inside the vbox64 emulation, which is usually close but not identical to the elapsed time,, or actual time used.

That task would be calculating on a whole or physical core if nothing else was trying to use the other virtual core for that physical core.

Multiple small proteins at once could give a long workunit.
ID: 106274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 12 Aug 06
Posts: 1181
Credit: 5,644,809
RAC: 2,152
Message 106275 - Posted: 25 May 2022, 15:27:28 UTC

Every single one I've had failed has had bugger all CPU time compared to wall time. I usually notice 27 seconds of work has been done in 5 hours and cancel it. Everything else has run to completion. I wonder if there's an automated way to detect suss CPU time ratios?
ID: 106275 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 12 Aug 06
Posts: 1181
Credit: 5,644,809
RAC: 2,152
Message 106278 - Posted: 26 May 2022, 15:04:04 UTC
Last modified: 26 May 2022, 15:04:25 UTC

Ah, this is the problem. The Python book has only one use:

ID: 106278 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 179
Credit: 23,122,300
RAC: 897
Message 106281 - Posted: 26 May 2022, 18:28:24 UTC

My quick analysis of desktop items in the photo , I see with python tasks they realy are comparing oranges with almonds . . . .
ID: 106281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 12 Aug 06
Posts: 1181
Credit: 5,644,809
RAC: 2,152
Message 106284 - Posted: 26 May 2022, 19:29:08 UTC - in response to Message 106281.  

What surprises me is they couldn't afford a monitor with adjustable height.
ID: 106284 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 12 Aug 06
Posts: 1181
Credit: 5,644,809
RAC: 2,152
Message 106285 - Posted: 26 May 2022, 20:19:05 UTC

I also like the way the width of the monitors is of no use. Bring back 4:3! 16:9 is for TVs!
ID: 106285 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 179
Credit: 23,122,300
RAC: 897
Message 106287 - Posted: 26 May 2022, 23:55:51 UTC
Last modified: 27 May 2022, 0:05:14 UTC

With some of the long work unit names rosetta has it gives a better chance to fit them on the screen , Save_aall_the_squishy_bIt5-and -puT_the_rest0uT_for_the_traj5.rAbid_raBit names

All this digital tecknology creating a paperless society . not
More and more bits of dead tree pulverized and squashed flat and skribled on to remind us WTF all that stuff on screen is about .
ID: 106287 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker of the Scottish Boinc Team
Avatar

Send message
Joined: 12 Aug 06
Posts: 1181
Credit: 5,644,809
RAC: 2,152
Message 106288 - Posted: 27 May 2022, 0:43:31 UTC - in response to Message 106287.  
Last modified: 27 May 2022, 0:50:37 UTC

I once had a colleague with 30 post it notes all around her monitor with all the passwords she used.

Paper is a renewable resource (and isn't it trapping that "evil" carbon?). At my work they said to stop using so much paper. People were alledgedly printing at 14p a page in colour. The management produced colour photocopiers that could do it for 6p a page. I pointed out we were actually using Brother printers with fake ink at 1p a page. The paper cost more than the ink even for a full colour page. Then I found out the "survey" on cost was done by the company (Xerox) renting us the photocopiers, using the cost of HP printers with genuine rip off ink. Then the arguments started.
ID: 106288 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Felicia

Send message
Joined: 8 May 22
Posts: 7
Credit: 91,865
RAC: 1,519
Message 106292 - Posted: 27 May 2022, 9:27:07 UTC - in response to Message 106285.  

I also like the way the width of the monitors is of no use. Bring back 4:3! 16:9 is for TVs!

I love 16:9 on my 25 inch, it's better than 16:10 for running 2 programs side by side (or three when troubleshooting logs, webclient and server side) .

That said, I've got a weird scheduling issue with my client. I have jobs that need to report before x but those jobs are not always the ones that get initiated when another job finishes. This leads to jobs reporting past their due date and I'm not sure whether that invalidates them.

Screenshot (sorted by report before date): https://imgur.com/a/wbHnfzf
There's 2 jobs that need to report before 28-5 6 am, and 2 that need to report before 11:30 am but there are 4 jobs running that need to report before 28-5 3:30pm and later.
ID: 106292 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 212 · 213 · 214 · 215 · 216 · 217 · 218 . . . 219 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2022 University of Washington
https://www.bakerlab.org