Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge traces before adding to waveforms_contiguoustrace table? #91

Open
megies opened this issue Aug 23, 2018 · 2 comments
Open

merge traces before adding to waveforms_contiguoustrace table? #91

megies opened this issue Aug 23, 2018 · 2 comments

Comments

@megies
Copy link
Collaborator

megies commented Aug 23, 2018

In real life, we sometimes encounter very unusual data.

On our production server 96% of entries in waveforms_contiguoustrace come from only 0.5% of all indexed files. The "worst" file sports an insane amount of 73228 gaps/overlaps. What's even worse is that after a cleanup merge (which takes 20 minutes) it only has a single trace left (i.e. only fragmentation + duplication of data in the file -- sometimes happens when there are glitches in real time streaming / archiving). For such files, obviously the current approach of storing one row per individual trace in file has severe drawbacks.

An alternative would be to perform some kind of merge after reading a waveform file but before adding data to the waveforms_contiguoustrace table.

Pros

  • much less bloating of DB in presence of files with overlaps or duplicated data
  • therefore likely much faster speed on some jane requests
  • obviously much less space used when e.g. backing up DB contents

Cons

  • indexing would in some cases take considerably longer, for the worst file I've ever seen just reading takes 1 minute, a cleanup merge afterwards takes 20 minutes (but indexing is a one-off process so this might be acceptable)
  • pos field in waveforms_contiguoustrace (i.e. number of respective trace after doing an obspy.read(...)) would to some extent lose (or at least change) its meaning, but I do not see that it actually ever is used in the current code base and I can not even imagine a real use case for it
  • npts field would also be affected by such a change, but again its of little value probably
  • preview_trace field would be affected as well, but the above Pros would likely outweigh losing some sanity in the preview traces
@megies
Copy link
Collaborator Author

megies commented Aug 31, 2018

I've started checking those waveform files that have the most individual traces in our jane instance in the background..

check_bad_files

Basically, all the very, very bad files that bloat the database actually just have duplicated and highly fragmented data in them. So I think our DB size (100GB currently) could be reduced by probably 70-90% by doing a cleanup merge during indexing waveform files.

Two possible strategies:

A) Keep pos field (in table waveforms_contiguoustrace)

  • add stream.merge(-1) during indexing
  • keep pos field in table layout
  • no other changes needed
  • old indexeded data in waveforms_contiguoustrace end up with potentially wrong information in pos column

B) Drop pos field (in table waveforms_contiguoustrace)

  • add stream.merge(-1) during indexing
  • drop pos field in table layout
  • table layout changes
  • no inconsistent data in current state of waveforms_contiguoustrace

I'm in favor of B) as I do not see that pos field is ever used anywhere.

Other opinions?

@megies
Copy link
Collaborator Author

megies commented Aug 31, 2018

Hmm, actually it might be good to keep pos field, otherwise we run into #84 again in case of two traces in a file that have different data in it but same stats (although this is a really unlikely and strange case).

So that would mean A) and the question is how bad we would get hurt by the fact that the pos data in the database would not be accurate anymore. I checked and the only piece of code that reads the pos data and uses it in some way is this piece of code during indexing:

src/jane/waveforms/process_waveforms.py

163 
164         # Get all existing traces.
165         for tr_db in models.ContinuousTrace.objects.filter(file=file):
166             # Attempt to get the existing trace object.
167             if tr_db.pos in traces_in_file:
168                 tr = traces_in_file[tr_db.pos]
169                 # Delete in the dictionary.
170                 del traces_in_file[tr_db.pos]
171 
172                 tr_db.timerange = DateTimeTZRange(
173                     lower=tr["starttime"].datetime,
174                     upper=tr["endtime"].datetime)
175                 tr_db.network = tr["network"]
176                 tr_db.station = tr["station"]
177                 tr_db.location = tr["location"]
178                 tr_db.channel = tr["channel"]
179                 tr_db.sampling_rate = tr["sampling_rate"]
180                 tr_db.npts = tr["npts"]
181                 tr_db.duration = tr["duration"]
182                 tr_db.quality = tr["quality"]
183                 tr_db.preview_trace = tr["preview_trace"]
184                 tr_db.pos = tr["pos"]
185                 tr_db.save()
186 
187             # If it does not exist in the waveform file, delete it here as
188             # it is (for whatever reason) no longer in the file..
189             else:
190                 tr_db.delete()
191

So to me it looks like..

  • we don't get problems with inaccurate pos field entries from old indexing runs, since pos is ever only used in this one place (which simply reuses existing rows in the DB), and..
  • I've checked in a sandbox jane, it doesn't get confused if the file suddenly has less traces in it, it correctly deletes any old left-over traces with higher pos values

So I think the only thing there is to do is add this one line and manually kick off indexers on those very gappy files to get rid of lots of unneeded rows in production DB.

diff --git a/src/jane/waveforms/process_waveforms.py b/src/jane/waveforms/process_waveforms.py
index 0a4beea..ef00c7c 100644
--- a/src/jane/waveforms/process_waveforms.py
+++ b/src/jane/waveforms/process_waveforms.py
@@ -57,6 +58,7 @@ def process_file(filename):
             file.delete()
         # Reraise the exception.
         raise
+    stream.merge(-1)
 
     if len(stream) == 0:
         msg = "'%s' is a valid waveform file but contains no actual data"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant