Posts by Yifan Song

41) Message boards : Number crunching : Granted Credit taking forever.... (Message 63339)
Posted 14 Sep 2009 by Yifan Song
Post:
The validator server has pretty much cleaned up the job that created the IO problem. Now it is catching up with the rest of jobs. It's a little hard to estimate how long that is going to take, but hopefully the worst is over.

I agree that we need to somehow balance our effort between science and IT. I'm still relatively new to this team and still feeling my way through the IT part of the project.Hopefully over time, I'll be able to help DEK on this.
42) Message boards : Number crunching : Granted Credit taking forever.... (Message 63310)
Posted 13 Sep 2009 by Yifan Song
Post:
I actually do still use these data. I just need to figure out a way to make them give me back fewer but better structures in the future.


Or build a new validator server that can handle these intense work units. I'm sure there would be few opinions as to exactly which processors and memory you should choose to build a new on too.


DEK and I thought about this too. But since the validator is spending most of the time on reading and merging the date, the bottle neck is on the disk IO. Adding another server wouldn't help much since the data is stored on the same file system. The only way to improve the rate of processing is to add another file system and divide the validator to work on mulitple file systems. This is a lot harder to do and has a high potential to screw up the entire R@H server. So we decided not to do that at this moment.
43) Message boards : Number crunching : Granted Credit taking forever.... (Message 63305)
Posted 12 Sep 2009 by Yifan Song
Post:
I'm see some odd behavior on the server side. Half of the jobs came back with just a quarter of the results compared to the other half. and the good and bad ones alternate in the file name order. Somehow those jobs are not giving the results back and not given credit either. I wonder if it a bug in the validator. Let me spend some time today to dig a little deeper.

Here is the number of results I get back for each sub-batch.
188812 1A0O
47621 1ACB
231038 1AHW
45743 1ATN
231584 1AVW
43186 1AVZ
229673 1BQL
48359 1BRC
233888 1BRS
47867 1BVK
229795 1CGI
49298 1CHO
228705 1CSE
46636 1DFJ
229493 1DQJ
47060 1EFU
223912 1EO8
48390 1FBI
219990 1FIN
45040 1FQ1
213427 1FSS
45614 1GLA
195231 1GOT
47590 1IAI
223155 1IGC
47341 1JHL
224621 1MAH
44385 1MDA
236081 1MEL
48276 1MLC
203274 1NCA
43052 1NMB
231571 1PPE
47207 1QFU
228290 1SPB
48206 1STF
226570 1TAB
48645 1TGS
236948 1UDI
49967 1UGH
229513 1WEJ
43105 1WQ1
219482 2BTF
48810 2JEL
234686 2KAI
48160 2PCC
230399 2PTC
48777 2SIC
231341 2SNI
48764 2TEC
224725 2VIR
47338 3HHR
233194 4HTC
44) Message boards : Number crunching : Granted Credit taking forever.... (Message 63297)
Posted 12 Sep 2009 by Yifan Song
Post:
The jobs that are canceled are the ones created IO problem for the server. DEK and I thought if we remove the job, it would stop the validator server from processing it. But turned out it didn't. So we'll have to wait for the server to finish processing the rest of the data.


Are those the ones i saw that were up to 11 MB result files?


Yep. they are large because those are often large protein complexes. Plus we needed to save the full cartesian coordinates for this system.
As for how credit is handled here, let me check with DEK.
45) Message boards : Number crunching : Granted Credit taking forever.... (Message 63293)
Posted 12 Sep 2009 by Yifan Song
Post:
I actually do still use these data. I just need to figure out a way to make them give me back fewer but better structures in the future.
46) Message boards : Number crunching : Granted Credit taking forever.... (Message 63291)
Posted 11 Sep 2009 by Yifan Song
Post:
The jobs that are canceled are the ones created IO problem for the server. DEK and I thought if we remove the job, it would stop the validator server from processing it. But turned out it didn't. So we'll have to wait for the server to finish processing the rest of the data.
47) Message boards : Number crunching : Granted Credit taking forever.... (Message 63250)
Posted 10 Sep 2009 by Yifan Song
Post:
Hi guys. Sorry about all the lags. The job I sent was a lot more IO intensive than I had expected.
48) Message boards : Number crunching : Minirosetta 1.95/1.96 (Message 63005)
Posted 22 Aug 2009 by Yifan Song
Post:
I just repackage the app. I'll be working this weekend to keep an eye on the server status.
49) Message boards : Number crunching : Rosetta Application Version Release Log (Message 63004)
Posted 22 Aug 2009 by Yifan Song
Post:
1.95 is repackaged to 1.96.
I'll be working this weekend to make sure it runs fine.
50) Message boards : Number crunching : Minirosetta 1.90 and 1.91 (Message 62703)
Posted 1 Aug 2009 by Yifan Song
Post:
Thanks! There was a change in that flag and I missed it. That work unit is disabled.

I got a few errors on lr5_combine_mods_run01_rlbn WUs.

http://boinc.bakerlab.org/rosetta/result.php?resultid=269713462
http://boinc.bakerlab.org/rosetta/result.php?resultid=269758962
http://boinc.bakerlab.org/rosetta/result.php?resultid=269787057
http://boinc.bakerlab.org/rosetta/result.php?resultid=269811876

They end after about 10 seconds with the error:

Native pose needed for OptionKeys::relax::constrain_relax_to_native_coords
ERROR:: Exit from: src/protocols/relax/ClassicRelax.cc line: 544
BOINC:: Error reading and gzipping output datafile: default.out

51) Message boards : Number crunching : Minirosetta 1.90 and 1.91 (Message 62685)
Posted 31 Jul 2009 by Yifan Song
Post:
The good news is that 1.90 seems to be stable so far. The network traffic is still heavy, but a lot better than yesterday.

As for Kaspersky, DK contacted the vendor a while ago and it went no where. Maybe it's time to try again.
52) Message boards : Number crunching : Rosetta Application Version Release Log (Message 62664)
Posted 30 Jul 2009 by Yifan Song
Post:
1.90 is up
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=5014
53) Message boards : Number crunching : Minirosetta 1.90 and 1.91 (Message 62660)
Posted 30 Jul 2009 by Yifan Song
Post:
all services were temporarily shut down to add more web servers.
54) Message boards : Number crunching : Chaos in Rosetta@Home??? (Message 62655)
Posted 30 Jul 2009 by Yifan Song
Post:
Well, for about two days, it was a little chaotic here. A bunch of us are running around trying to figure out where the bug is and how to fix etc.
But be assured that the chaos is not in the science part and hopefully just temporary. I'll post something more detailed on what we've tried to solve this problem.
55) Message boards : Number crunching : Minirosetta 1.90 and 1.91 (Message 62652)
Posted 30 Jul 2009 by Yifan Song
Post:
This version should solve the slowing down, and instant quitting problems.

New protocol added for predictions of changes in protein stability by mutations.

-----
DEK posted an explanation on what caused all the trouble in the last week:
http://boinc.bakerlab.org/rosetta/forum_thread.php?id=5011&nowrap=true#62640

Here are some more details on what was done so far:
1. The signature problem that initially causes massive network traffic was solved on Monday.
2. The slowing down of the program was caused by two large changes in the code. One of the changes is to allow rosetta to model large, symmetric molecules (oligomers http://en.wikipedia.org/wiki/Oligomer). And the other is to allow modeling atomic interactions with higher definition. The bug introduced in the first change was fixed. And the second change is temporarily reversed until further evaluation of the computation cost.

Now unfortunately, due to the signature error and the update of the program, the web server will be extremely busy for the next few days. So downloading/uploading errors are still expected.
56) Message boards : Number crunching : Minirosetta 1.82/1.88 (Message 62648)
Posted 30 Jul 2009 by Yifan Song
Post:
Sorry about these errors.
A lot of the errors (especially the disulfide ones) are due to the jobs submitted a while ago that need the new flags from the latest update. And after the reversion, those jobs are no longer valid.
Our file server is a bit hammered right now, so it's hard to change those job status as well.
We have found the bug that causes the slowing down of the code though, and we have tested the bug fix on our alpha server. Thus we are updating the program.
Massive network traffic is still expected due to the version update. Hopefully everything will be stabilized in a few days.
57) Message boards : Number crunching : v 187 very low credit (Message 62576)
Posted 29 Jul 2009 by Yifan Song
Post:
This might be a bug. We're working on another update.
Sorry about the trouble.
58) Message boards : Number crunching : Rosetta Application Version Release Log (Message 61887)
Posted 22 Jun 2009 by Yifan Song
Post:
Rosetta 1.80 is up.
59) Message boards : Number crunching : Problems with Minirosetta 1.80 (Message 61886)
Posted 22 Jun 2009 by Yifan Song
Post:
In this version:
New protein-protein docking protocol.
New rotamer library.
60) Message boards : Number crunching : Problems with Minirosetta 1.76 (Message 61791)
Posted 16 Jun 2009 by Yifan Song
Post:
This is a minor update to fix the problems with validation.


Previous 20 · Next 20



©2024 University of Washington
https://www.bakerlab.org