Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with tycoon database?? #108

Open
tdlong opened this issue May 9, 2018 · 12 comments
Open

Problem with tycoon database?? #108

tdlong opened this issue May 9, 2018 · 12 comments

Comments

@tdlong
Copy link

tdlong commented May 9, 2018

I am having a weird problem with progressive Cactus, although it has normally worked for me in its current configuration. I submit a job (aligning human, mouse, rat, and a Peromyscus mouse) and it runs for several days then get locked in some sort of death loop. Once it fails it just keeps failing. The problems in the log file seem to start here. Is this an error message that gives you guys any idea what I am doing wrong, it is difficult for me to interpret the error (there is over 500G of memory).

Got message from job at time: 1525843963.25 : Starting reference phase target with index 0 at 1525843963.2 seconds (recursing = 1)
Got message from job at time: 1525843973.33 : Blocking on ktserver <kyoto_tycoon database_dir="/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB_tempSecondaryDatabaseDir_0.695478896174" database_name="Anc0.kch" i
n_memory="1" port="2078" snapshot="0" />
with killPath /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/gTD1/tmp_6GL2ml6EOK/tmp_9EWsl0wYJh_kill.txt
Got message from job at time: 1525844868.99 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 3964997259434639716 total bases: 83317999 total-ends: 1770 total-caps: 9
266 max-end-degree: 74 max-adjacency-length: 2489366 total-blocks: 0 total-groups: 1 total-edges: 1442 total-free-ends: 6 total-attached-ends: 1764 total-chains: 0 total-link groups: 0
Got message from job at time: 1525844868.99 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 3964997259434640271 total bases: 75659437 total-ends: 1788 total-caps: 1
0268 max-end-degree: 133 max-adjacency-length: 18132749 total-blocks: 0 total-groups: 1 total-edges: 1347 total-free-ends: 24 total-attached-ends: 1764 total-chains: 0 total-link groups: 0
Got message from job at time: 1525844868.99 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 3964997259434640067 total bases: 58824370 total-ends: 2787 total-caps: 1
4716 max-end-degree: 104 max-adjacency-length: 9797885 total-blocks: 0 total-groups: 1 total-edges: 2253 total-free-ends: 63 total-attached-ends: 2724 total-chains: 0 total-link groups: 0
Got message from job at time: 1525844914.87 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 4947485665643181168 total bases: 83317999 total-ends: 124726 total-caps:
1049126 max-end-degree: 74 max-adjacency-length: 1100541 total-blocks: 60543 total-groups: 57688 total-edges: 64313 total-free-ends: 1876 total-attached-ends: 1764 total-chains: 3474 total-link groups: 55437
Got message from job at time: 1525844981.29 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 430515976878946939 total bases: 75659437 total-ends: 105496 total-caps:
977896 max-end-degree: 133 max-adjacency-length: 18090926 total-blocks: 50881 total-groups: 47607 total-edges: 55100 total-free-ends: 1970 total-attached-ends: 1764 total-chains: 3617 total-link groups: 45036
Got message from job at time: 1525845179.69 : Adding an oversize flower for target class <class 'cactus.pipeline.cactus_workflow.CactusReferenceWrapper'> and stats flower name: 430515976878946123 total bases: 58824370 total-ends: 127341 total-caps:
1817786 max-end-degree: 104 max-adjacency-length: 9787512 total-blocks: 60028 total-groups: 50800 total-edges: 70779 total-free-ends: 4561 total-attached-ends: 2724 total-chains: 6932 total-link groups: 45031
Got message from job at time: 1525845270.82 : Adding an oversize flower 4947485665643181168 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'>
Got message from job at time: 1525845397.24 : Adding an oversize flower 3964997259434639716 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'>
Got message from job at time: 1525845445.12 : Adding an oversize flower 430515976878946939 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'>
Got message from job at time: 1525845511.73 : Adding an oversize flower 430515976878946123 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'>
Got message from job at time: 1525845542.07 : Adding an oversize flower 3964997259434640271 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'>
Got message from job at time: 1525845620.7 : Adding an oversize flower 3964997259434640067 for target class <class 'cactus.pipeline.cactus_workflow.CactusSetReferenceCoordinatesUpWrapper'>
The job seems to have left a log file, indicating failure: /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t0/t1/t1/t1/job
Reporting file: /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t0/t1/t1/t1/log.txt
log.txt: 9 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 4 1 58969007620880249 66 49 30 260 27 65 30 450 30 30 49 30 30 104 30 27 30 49 68 49 30 30 88 39 30 39 30 49 30 49 30 27 49 30 30 77 27 49 60 30 61 30 163 68 85 215
63 88 213 27 590 49 49 75 72 38 24 38 24 69 159 36 53 53 40 62 30 85 24 40 65 33 24 24 40 24 33 24 38 69 42 138 64 40 88 85 315 27 46 30 27 27 30 30 199 49 68 68 55 71 49 30 42 46 84 49 131 30 79 30 55 30 30 27 49 30 27 49 202 30 50 27 30 152 119 40
127 87 30 30 46 72 52 68 66 106 24 40 56 65 24 24 24 51 76 45 87 265 36 36 36 33 1958 67 71 64 24 24 43 30 58 84 27 56 93 27 87 30 106 49 87 49 304 30 110 30 77 49 30 30 27 58 68 49 49 87 39 68 27 64 27 27 147 49 507 30 88 49 93 52 30 77 30 71 239
27 30 30 49 87 30 66 77 27 144 30 29132660089536400 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 4 1 45035996273704294 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 1
9 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 1
...stuff deleted...
160 97 27 54 421 54 63 54 27 51 54 39 73 116 71 54 73 30 60 30 30 30 30 51 350 27 39 39 30 39 46 30 30 54 27 39 102 63 73 30 51 92 49 97 27 27 30 69 61 94 55 27 54 54 27 54 209 49 73 94 135 27 73 135 108 73 100 70 54 51 379 175 54 63 27 63 27 66 30
54 65 49 39 54 30 49 103 97 106 51 54 30 30 39 567 30 283 71 73 30 49 116 30 70 30 27 30 51 39 27 144 30 54 30 66 41 39 73 94 54 109 27 115 66 68 197 30 30 44 8022036836287588 127 49 30 27 84 30 27 47 112 27 49 74 103 30 148 48 30 30 30 68 97 156 30
85 97 30 54 185 30 49 137 49 54 73 51 58 54 30 52 82 27 73 52 54 30 92 30 54 49 51 30 98 30 27 254 30 30 30 73 68 92 30 30 27 52 178 140 73 27 30 30 49 49 27 27 101 152 30 49 87 27 58 58 30 66 68 27 27 49 49 30 39 30 27 64 27 27 49 68 27 30 39 86 2
7 30 86 68 30 68 30 30 39 27 46 27 30 49 49 68 49 30 30 30 39 39 30 27 30 98 61 77 49 48 49 63 42 27 30 30 68 30 39 27 49 68 42 168 46 30 30 104 30 27 49 61 49 30 69 30 47 77 27 49 30 49 47 30 49 47 49 87 30 157 67 30 46 27 58 87 52 49 30 27 264 49
100 360 49 49 30 30 30 87 46 30 113 27 116 73 71 54 30 54 51 30 73 49 49 27 122 47 55 97 30 27 30 27 63 70 92 39 119 54 71 54 54 51 46 51' exited with non-zero status 128
log.txt: Exiting the slave because of a failed job on host compute-4-43.local
log.txt: Due to failure we are reducing the remaining retry count of job /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t2/t0/t2/t0/job to 0
log.txt: We have set the default memory of the failed job to 34359738368 bytes
Job: /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t2/t0/t2/t0/job is completely failed
The job seems to have left a log file, indicating failure: /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t2/t0/t2/t1/job
Reporting file: /share/adl/tdlong/peromyscus/Progressive/PCwork/jobTree/jobs/t1/t1/t3/t2/t0/t2/t1/log.txt
log.txt: ---JOBTREE SLAVE OUTPUT LOG---
log.txt: Exception: ST_KV_DATABASE_EXCEPTION: Opening connection to host: 10.1.255.117 with error: network error
log.txt: Uncaught exception
log.txt: Traceback (most recent call last):
log.txt: File "/data/apps/progressiveCactus/submodules/jobTree/src/jobTreeSlave.py", line 271, in main
log.txt: defaultMemory=defaultMemory, defaultCpu=defaultCpu, depth=depth)
log.txt: File "/data/apps/progressiveCactus/submodules/jobTree/scriptTree/stack.py", line 153, in execute
log.txt: self.target.run()
log.txt: File "/data/apps/progressiveCactus/submodules/cactus/pipeline/cactus_workflow.py", line 804, in run
log.txt: bottomUpPhase=True)
log.txt: File "/data/apps/progressiveCactus/submodules/cactus/shared/common.py", line 382, in runCactusAddReferenceCoordinates
log.txt: popenPush(command, stdinString=flowerNames)
log.txt: File "/data/apps/progressiveCactus/submodules/sonLib/bioio.py", line 224, in popenPush
log.txt: raise RuntimeError("Command: %s with stdin string '%s' exited with non-zero status %i" % (command, stdinString, sts))
log.txt: RuntimeError: Command: cactus_addReferenceCoordinates --cactusDisk '<st_kv_database_conf type="kyoto_tycoon">
log.txt: <kyoto_tycoon database_dir="/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB" database_name="Anc0.kch" host="10.1.255.117" in_memory="1" port="1978" snapshot="0" />
log.txt: </st_kv_database_conf>
log.txt: ' --secondaryDisk '<st_kv_database_conf type="kyoto_tycoon">
log.txt: <kyoto_tycoon database_dir="/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB_tempSecondaryDatabaseDir_0.695478896174" database_name="Anc0.kch" host="10.1.255.117" in_memory="1" port="2078" sna
pshot="0" />
log.txt: </st_kv_database_conf>' --logLevel CRITICAL --referenceEventString Anc0 --bottomUpPhase with stdin string '8712 8994814355765721555 79 30 49 63 73 49 54 73 54 73 51 55 30 68 75 54 116 54 92 92 140 54 94 63 51 73 54 92 54 94 30 27
46 27 53 65 30 87 27 27 77 49 27 113 71 174 12666373951930501 30 49 71 58 30 30 56 116 137 41 73 30 30 27 92 30 97 39 54 30 27 101 116 30 30 54 73 30 54 54 82 30 104 41 30 42 27 30 73 42 30 100 46 93 54 30 150 97 97 54 109 70 94 49 130 49 97 49 39 1
49 30 30 39 49 41 49 49 30 49 30 30 49 30 27 30 65 30 54 39 49 46 73 120 65 49 30 54 27 93 27 63 27 134 47 30 49 39 30 77 27 77 54 92 30 30 92 30 52 49 71 30 46 27 73 64 51 94 190 97 54 54 54 51 97 30 73 30 71 30 73 97 39 97 144 66 46 49 30 30 30 46
212 73 80 54 80 49 30 30 49 71 39 54 93 54 49 54 54 49 30 71 134 30 54 30 49 30 49 67 73 160 54 30 27 73 73 54 73 30 30 2354 49 51 30 87 27 39 27 42 30 30 46 135 104 97 63 30 30 30 51 30 92 116 118 95 54 73 39 68 92 63 196 46 49 30 27 30 183 30 30
30 30 55 27 49 71 39 129 30 30 73 30 30 63 41 27 41 54 54 7740561859526669 92 73 159 116 54 54 30 63 49 30 159 30 49 122 55 27 30 49 30 171 54 178 69 73 127 30 30 30 46 96 49 85 54 54 51 116 30 39 106 30 73 55 30 68 84 93 159 97 41 103 111 27 49 30
99 30 98 73 39 27 27 183 27 240 49 54 52 39 39 135 60 30 30 30 27 73 30 27 30 27 30 65 97 70 54 116 27 194 30 30 39 126 54 73 114 30 52 106 30 51 51 73 54 68 80 54 68 30 116 39 51 63 218 54 27 30 106 94 27 30 30 106 30 132 54 39 30 228 46 131 27 55
30 30 48 49 30 47 51 71 178 97 27 51 142 54 74 154 30 94 73 30 30 30 30 30 49 30 30 130 73 30 670 30 27 54 51 30 65 27 116 30 30 27 49 46 49 85 47 27 30 54 54 54 63 74 92 27 39 30 30 121 54 54 53 73 70 54 73 107 49 147 49 27 30 30 94 49 30 42 30 94

@joelarmstrong
Copy link
Collaborator

joelarmstrong commented May 9, 2018

Hmm, it sounds like the database crashed, or at least became inaccessible. That could be for any number of reasons, but if it's not just some random fluke, it's likely crashed because it ran out of memory. Once the database is gone, it definitely enters the sort of "death loop" you're talking about. The only way out is to stop and restart from the last subproblem, with the --restart flag. That should work if the database crash was caused by some fluke, but if it's a memory issue it would just crash after a few days again.

Can you share the contents of these files (database logs) if they exist?:

/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB/ktout.log

/share/adl/tdlong/peromyscus/Progressive/PCwork/progressiveAlignment/Anc0/Anc0/Anc0_DB_tempSecondaryDatabaseDir_0.695478896174/ktout.log

@tdlong
Copy link
Author

tdlong commented May 9, 2018 via email

@tdlong
Copy link
Author

tdlong commented May 9, 2018 via email

@tdlong
Copy link
Author

tdlong commented May 9, 2018 via email

@joelarmstrong
Copy link
Collaborator

Oops, sorry about that. I was thinking of the newer toil syntax. progressiveCactus should restart just fine without the --restart option.

@tdlong
Copy link
Author

tdlong commented May 11, 2018 via email

@tdlong
Copy link
Author

tdlong commented May 12, 2018 via email

@joelarmstrong
Copy link
Collaborator

joelarmstrong commented May 12, 2018

Hmm, weird. The N50 of the assemblies shouldn't really have much of an effect on the memory usage. Are all the assemblies soft masked? Unmasked assemblies can cause really high memory usage because of the amount of alignments, though it usually causes problems at an earlier stage.

@tdlong
Copy link
Author

tdlong commented May 12, 2018 via email

@joelarmstrong
Copy link
Collaborator

Yes, that sounds great. Thanks for helping to track this down!

@tdlong
Copy link
Author

tdlong commented May 15, 2018 via email

@tdlong
Copy link
Author

tdlong commented May 19, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants