merge traces before adding to waveforms_contiguoustrace table? #91

megies · 2018-08-23T12:56:49Z

In real life, we sometimes encounter very unusual data.

On our production server 96% of entries in waveforms_contiguoustrace come from only 0.5% of all indexed files. The "worst" file sports an insane amount of 73228 gaps/overlaps. What's even worse is that after a cleanup merge (which takes 20 minutes) it only has a single trace left (i.e. only fragmentation + duplication of data in the file -- sometimes happens when there are glitches in real time streaming / archiving). For such files, obviously the current approach of storing one row per individual trace in file has severe drawbacks.

An alternative would be to perform some kind of merge after reading a waveform file but before adding data to the waveforms_contiguoustrace table.

Pros

much less bloating of DB in presence of files with overlaps or duplicated data
therefore likely much faster speed on some jane requests
obviously much less space used when e.g. backing up DB contents

Cons

indexing would in some cases take considerably longer, for the worst file I've ever seen just reading takes 1 minute, a cleanup merge afterwards takes 20 minutes (but indexing is a one-off process so this might be acceptable)
pos field in waveforms_contiguoustrace (i.e. number of respective trace after doing an obspy.read(...)) would to some extent lose (or at least change) its meaning, but I do not see that it actually ever is used in the current code base and I can not even imagine a real use case for it
npts field would also be affected by such a change, but again its of little value probably
preview_trace field would be affected as well, but the above Pros would likely outweigh losing some sanity in the preview traces

The text was updated successfully, but these errors were encountered:

megies · 2018-08-31T12:39:17Z

I've started checking those waveform files that have the most individual traces in our jane instance in the background..

Basically, all the very, very bad files that bloat the database actually just have duplicated and highly fragmented data in them. So I think our DB size (100GB currently) could be reduced by probably 70-90% by doing a cleanup merge during indexing waveform files.

Two possible strategies:

A) Keep `pos` field (in table `waveforms_contiguoustrace`)

add stream.merge(-1) during indexing
keep pos field in table layout
no other changes needed
old indexeded data in waveforms_contiguoustrace end up with potentially wrong information in pos column

B) Drop `pos` field (in table `waveforms_contiguoustrace`)

add stream.merge(-1) during indexing
drop pos field in table layout
table layout changes
no inconsistent data in current state of waveforms_contiguoustrace

I'm in favor of B) as I do not see that pos field is ever used anywhere.

Other opinions?

megies · 2018-08-31T14:10:25Z

Hmm, actually it might be good to keep pos field, otherwise we run into #84 again in case of two traces in a file that have different data in it but same stats (although this is a really unlikely and strange case).

So that would mean A) and the question is how bad we would get hurt by the fact that the pos data in the database would not be accurate anymore. I checked and the only piece of code that reads the pos data and uses it in some way is this piece of code during indexing:

src/jane/waveforms/process_waveforms.py

163 
164         # Get all existing traces.
165         for tr_db in models.ContinuousTrace.objects.filter(file=file):
166             # Attempt to get the existing trace object.
167             if tr_db.pos in traces_in_file:
168                 tr = traces_in_file[tr_db.pos]
169                 # Delete in the dictionary.
170                 del traces_in_file[tr_db.pos]
171 
172                 tr_db.timerange = DateTimeTZRange(
173                     lower=tr["starttime"].datetime,
174                     upper=tr["endtime"].datetime)
175                 tr_db.network = tr["network"]
176                 tr_db.station = tr["station"]
177                 tr_db.location = tr["location"]
178                 tr_db.channel = tr["channel"]
179                 tr_db.sampling_rate = tr["sampling_rate"]
180                 tr_db.npts = tr["npts"]
181                 tr_db.duration = tr["duration"]
182                 tr_db.quality = tr["quality"]
183                 tr_db.preview_trace = tr["preview_trace"]
184                 tr_db.pos = tr["pos"]
185                 tr_db.save()
186 
187             # If it does not exist in the waveform file, delete it here as
188             # it is (for whatever reason) no longer in the file..
189             else:
190                 tr_db.delete()
191

So to me it looks like..

we don't get problems with inaccurate pos field entries from old indexing runs, since pos is ever only used in this one place (which simply reuses existing rows in the DB), and..
I've checked in a sandbox jane, it doesn't get confused if the file suddenly has less traces in it, it correctly deletes any old left-over traces with higher pos values

So I think the only thing there is to do is add this one line and manually kick off indexers on those very gappy files to get rid of lots of unneeded rows in production DB.

diff --git a/src/jane/waveforms/process_waveforms.py b/src/jane/waveforms/process_waveforms.py
index 0a4beea..ef00c7c 100644
--- a/src/jane/waveforms/process_waveforms.py
+++ b/src/jane/waveforms/process_waveforms.py
@@ -57,6 +58,7 @@ def process_file(filename):
             file.delete()
         # Reraise the exception.
         raise
+    stream.merge(-1)
 
     if len(stream) == 0:
         msg = "'%s' is a valid waveform file but contains no actual data"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge traces before adding to waveforms_contiguoustrace table? #91

merge traces before adding to waveforms_contiguoustrace table? #91

megies commented Aug 23, 2018

megies commented Aug 31, 2018 •

edited

Loading

megies commented Aug 31, 2018 •

edited

Loading

merge traces before adding to waveforms_contiguoustrace table? #91

merge traces before adding to waveforms_contiguoustrace table? #91

Comments

megies commented Aug 23, 2018

Pros

Cons

megies commented Aug 31, 2018 • edited Loading

A) Keep pos field (in table waveforms_contiguoustrace)

B) Drop pos field (in table waveforms_contiguoustrace)

megies commented Aug 31, 2018 • edited Loading

megies commented Aug 31, 2018 •

edited

Loading

A) Keep `pos` field (in table `waveforms_contiguoustrace`)

B) Drop `pos` field (in table `waveforms_contiguoustrace`)

megies commented Aug 31, 2018 •

edited

Loading