-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge traces before adding to waveforms_contiguoustrace table? #91
Comments
I've started checking those waveform files that have the most individual traces in our jane instance in the background.. Basically, all the very, very bad files that bloat the database actually just have duplicated and highly fragmented data in them. So I think our DB size (100GB currently) could be reduced by probably 70-90% by doing a cleanup merge during indexing waveform files. Two possible strategies: A) Keep
|
Hmm, actually it might be good to keep So that would mean A) and the question is how bad we would get hurt by the fact that the
163
164 # Get all existing traces.
165 for tr_db in models.ContinuousTrace.objects.filter(file=file):
166 # Attempt to get the existing trace object.
167 if tr_db.pos in traces_in_file:
168 tr = traces_in_file[tr_db.pos]
169 # Delete in the dictionary.
170 del traces_in_file[tr_db.pos]
171
172 tr_db.timerange = DateTimeTZRange(
173 lower=tr["starttime"].datetime,
174 upper=tr["endtime"].datetime)
175 tr_db.network = tr["network"]
176 tr_db.station = tr["station"]
177 tr_db.location = tr["location"]
178 tr_db.channel = tr["channel"]
179 tr_db.sampling_rate = tr["sampling_rate"]
180 tr_db.npts = tr["npts"]
181 tr_db.duration = tr["duration"]
182 tr_db.quality = tr["quality"]
183 tr_db.preview_trace = tr["preview_trace"]
184 tr_db.pos = tr["pos"]
185 tr_db.save()
186
187 # If it does not exist in the waveform file, delete it here as
188 # it is (for whatever reason) no longer in the file..
189 else:
190 tr_db.delete()
191 So to me it looks like..
So I think the only thing there is to do is add this one line and manually kick off indexers on those very gappy files to get rid of lots of unneeded rows in production DB. diff --git a/src/jane/waveforms/process_waveforms.py b/src/jane/waveforms/process_waveforms.py
index 0a4beea..ef00c7c 100644
--- a/src/jane/waveforms/process_waveforms.py
+++ b/src/jane/waveforms/process_waveforms.py
@@ -57,6 +58,7 @@ def process_file(filename):
file.delete()
# Reraise the exception.
raise
+ stream.merge(-1)
if len(stream) == 0:
msg = "'%s' is a valid waveform file but contains no actual data" |
In real life, we sometimes encounter very unusual data.
On our production server 96% of entries in
waveforms_contiguoustrace
come from only 0.5% of all indexed files. The "worst" file sports an insane amount of 73228 gaps/overlaps. What's even worse is that after a cleanup merge (which takes 20 minutes) it only has a single trace left (i.e. only fragmentation + duplication of data in the file -- sometimes happens when there are glitches in real time streaming / archiving). For such files, obviously the current approach of storing one row per individual trace in file has severe drawbacks.An alternative would be to perform some kind of merge after reading a waveform file but before adding data to the
waveforms_contiguoustrace
table.Pros
Cons
pos
field inwaveforms_contiguoustrace
(i.e. number of respective trace after doing anobspy.read(...)
) would to some extent lose (or at least change) its meaning, but I do not see that it actually ever is used in the current code base and I can not even imagine a real use case for itnpts
field would also be affected by such a change, but again its of little value probablypreview_trace
field would be affected as well, but the above Pros would likely outweigh losing some sanity in the preview tracesThe text was updated successfully, but these errors were encountered: