Add support for filterbank files? #84

wfarah · 2023-03-26T19:52:08Z

The current output of the ATA are files in filterbank.fil format.
Is there scope for adding support for filterbank files input to hyperseti?

The text was updated successfully, but these errors were encountered:

texadactyl · 2023-03-26T21:54:44Z

@wfarah Not using rawspec to generate .h5 files directly?

telegraphic · 2023-03-26T23:22:38Z

@wfarah you want

from hyperseti.io import from_fil

telegraphic · 2023-03-27T00:32:31Z

(I'll update docs!)

telegraphic · 2023-03-27T00:33:29Z

You can also use load_data:

def load_data(filename: str) -> DataArray:
    """ Load file and return data array
    Args:
        filename (str): Name of file to load. 
    
    Returns:
        ds (DataArray): DataArray object
    
    Notes:
        This method also supports input as a setigen Frame,
        or as an existing DataArray.
    """
    if isinstance(filename, DataArray):
        ds = filename           # User has actually supplied a DataArray
    elif isinstance(filename, str):
        if h5py.is_hdf5(filename):
            ds = from_h5(filename)
        elif sigproc.is_filterbank(filename):
            ds = from_fil(filename)
        else:
            raise RuntimeError("Only HDF5/filterbank files or DataArray/setigen.Frame objects are currently supported")
    elif isinstance(filename, stg.Frame):
        ds = from_setigen(filename)
    else:
        raise RuntimeError("Only HDF5/filterbank files or DataArray/setigen.Frame objects are currently supported")
    return ds

telegraphic · 2023-03-27T00:38:07Z

I should also document how to use the from_metadata function. Planning on adding from_bifrost too, which will take a bifrost _tensor dictionary and generate a DataArray.

In terms of working with Filterbank files, not too much to note, except that it's different to blimpy. The data are loaded using np.memmap, i.e. not loaded but memory-mapped by default. Low bitwidths (<8 bits) are not supported for this reason.

You can see the source code here, it's refreshingly short:
https://github.com/UCBerkeleySETI/hyperseti/blob/master/hyperseti/io/filterbank.py

The from blimpy.io import sigproc module is copied over from blimpy.

Let me know how you go!

telegraphic · 2023-03-27T00:54:55Z

PS: Here's a script I'm running on blpc0 to see how it fares on multiple files (random old data from Parkes GP survey). I've set a super high drift rate (100 Hz/s), and I'm using apply_smearing_corr = True to account for doppler smearing, and n_blank = 8 so it searches for low S/N signals after blanking higher S/N signals:

import glob, os, pprint
from hyperseti import find_et
from hyperseti.io import load_config
from hyperseti.io.hit_db import HitDatabase

config_file = 'pksmb.yaml'              # Pipeline config file to load
hit_db      = 'pks_mb.hitdb'            # Output HitDatabase name
fpath       = '/datag/collate_mb/PKS_0240_2017-12-21T19:00/'
hires_ext   = '*.0000.hires.hdf'        # Extension to use to ID hi-freq res data
gpu_id      = 3                         # GPU to attach to (blpc nodes can be busy!)

# Load config and glob folder list
blc_folders = sorted(glob.glob(os.path.join(fpath, 'blc*')))
config = load_config(config_file)
pprint.pprint(config)

# Create new hit database
db = HitDatabase(hit_db, mode='w')

# Loop through blcXX folders, and search for hires H5 files
# (these data are older )
for ff, folder in enumerate(blc_folders):
    filelist = glob.glob(os.path.join(folder, hires_ext))
    for ii, filename in enumerate(filelist):
        print(f"(node {ff} / {len(blc_folders)}: file {ii + 1}/{len(filelist)}) Opening {filename}...")
        
        # A slightly more friendly observation ID, prepending blcXX to ensure ID is unique
        blc_id = os.path.basename(folder)
        obs_id = os.path.basename(filename)
        obs_id = obs_id.replace('guppi_', '').replace(hires_ext,'')
        obs_id = f'{blc_id}_{obs_id}'

        # Search for hits, file by file. Do not save to file here as we save to HitDatabase
        hit_browser = find_et(filename, config, 
                        gulp_size=2**20,
                        gpu_id=3,
                        filename_out=None, log_output=False, log_config=False
                        )

        # Save to the HitDatabase 
        print(f"Saving to obs_id: {obs_id}")
        hit_browser.to_db(db, obs_id=obs_id)

And the config file:

preprocess:
  blank_edges:
    n_chan: 1024
  normalize: true
  sk_flag: 
    n_sigma: 10
  blank_extrema: 
    threshold: 10000
  poly_fit: 5
dedoppler:
  apply_smearing_corr: true
  kernel: ddsk
  max_dd: 100.0
  min_dd: null
  plan: stepped
hitsearch:
  min_fdistance: null
  threshold: 20
pipeline:
  merge_boxcar_trials: true
  n_boxcar: 1
  n_blank: 8

wfarah · 2023-03-27T21:53:18Z

We have a real-time beamformer/upchannelizer that we usually utilize, and we are currently writing out .fil files. We would like to ultimately write .h5 files as that appears to be more practical for many reasons.

In the meantime, I think we can use the above example, so thanks for providing it!

mirosaide · 2023-04-08T04:51:16Z

I have tried to run the following:

from IPython.display import display
import time
from hyperseti.io import load_data

start_time = time.time()

path = load_data("fil_60019_75483_194942626_alf_Aql_0001-beam0000.fil")

config = {
    'preprocess': {
        'sk_flag': True,                        # Apply spectral kurtosis flagging
        'normalize': True,                      # Normalize data
        'blank_edges': {'n_chan': 32},          # Blank edges channels
        'blank_extrema': {'threshold': 10000}   # Blank ridiculously bright signals before search
    },
    'dedoppler': {
        'kernel': 'ddsk',                       # Doppler + kurtosis doppler (ddsk)
        'max_dd': 10.0,                         # Maximum dedoppler delay, 10 Hz/s
        'apply_smearing_corr': True,            # Correct  for smearing within dedoppler kernel
        'plan': 'stepped'                       # Dedoppler trial spacing plan (stepped = less memory)
    },
    'hitsearch': {
        'threshold': 20,                        # SNR threshold above which to consider a hit
    },
    'pipeline': {
        'merge_boxcar_trials': True             # Merge hits at same frequency that are found in multiple boxcars
    }
}

hit_browser = find_et(path, config, gulp_size=2**20)
display(hit_browser.hit_table)

hit_browser.view_hit(0, padding=128, plot='dual')

end_time = time.time()

runtime = (end_time - start_time)/60

print(f"Runtime: {runtime:.2f} `minutes") ```


Then got the following error:

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 9,500,098,560 bytes (allocated so far: 3,196,059,648 bytes).

telegraphic · 2023-04-08T06:35:31Z

@mirosaide you may have guessed it, but you ran out of memory on the GPU!

You'll need to either 1) free up memory on the GPU if other things are running 2) narrow your doppler drift range down (maybe try +/- 1 Hz/s and see if it fits on?), or 3) choose a smaller gulp size (2**18).

BTW, iterative blanking is faster now, and I would suggest using it if you're searching out to +/- 10 Hz. This config goes into config['pipeline']:

    'pipeline': {
        'merge_boxcar_trials': True,
        'blank_hits': {'n_blank': 4, 'padding': 16}
    }

mirosaide · 2023-04-10T19:24:22Z

I tried again with:

from hyperseti.pipeline import find_et
from IPython.display import display
import time
from hyperseti.io import load_data



start_time = time.time()

path = load_data("fil_60031_05757_253968139_WISE_J223617.59+510551._0001-beam0000.fil")


config = {
    'preprocess': {
        'sk_flag': True,                        # Apply spectral kurtosis flagging
        'normalize': True,                      # Normalize data
        'blank_edges': {'n_chan': 32},          # Blank edges channels
        'blank_extrema': {'threshold': 10000}   # Blank ridiculously bright signals before search
    },
    'dedoppler': {
        'kernel': 'ddsk',                       # Doppler + kurtosis doppler (ddsk)
        'max_dd': 1.0,                         # Maximum dedoppler delay, 10 Hz/s
        'apply_smearing_corr': True,            # Correct  for smearing within dedoppler kernel
        'plan': 'stepped'                       # Dedoppler trial spacing plan (stepped = less memory)
    },
    'hitsearch': {
        'threshold': 20,                        # SNR threshold above which to consider a hit
    },
    'pipeline': {
        'merge_boxcar_trials': True,
        'blank_hits': {'n_blank': 4, 'padding': 16}             # Merge hits at same frequency that are found in multiple boxcars
    }
}

hit_browser = find_et(path, config, gulp_size=2**18)
display(hit_browser.hit_table)

hit_browser.view_hit(0, padding=16, plot='dual')

end_time = time.time()

runtime = (end_time - start_time)/60

print(f"Runtime: {runtime:.2f} minutes")

Now getting:

[find_et] find_et: hyperseti version 0.8.0
[find_et] Progress 1/192
[hyperseti.normalize] High flagged fraction: 2.000
[hyperseti.normalize] Too much data flagged: 2.000
[find_et] 	Preprocess mean:       [1. 1.]
[find_et] 	Preprocess STD:        [0. 0.]
[find_et] 	Preprocess flagged:    200.00%
Traceback (most recent call last):
  File "/mnt/buf0/msaide/test_hyperseti.py", line 35, in <module>
    hit_browser = find_et(path, config, gulp_size=2**18)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/utils.py", line 47, in wrapper
    res = func(*arg, **kwargs)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/pipeline.py", line 332, in find_et
    hits = pipeline.run()
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/utils.py", line 47, in wrapper
    res = func(*arg, **kwargs)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/pipeline.py", line 234, in run
    self.hitsearch()
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/utils.py", line 47, in wrapper
    res = func(*arg, **kwargs)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/pipeline.py", line 181, in hitsearch
    _peaks = hitsearch(self.dedopp, **conf['hitsearch'])
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/hits.py", line 227, in hitsearch
    intensity, fcoords, dcoords = peak_find(imgdata, threshold=threshold, min_spacing=min_fdistance)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/kernels/peak_finder.py", line 293, in peak_find
    find_max_1D(dedopp_data, maxval_gpu, maxidx_gpu)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/kernels/peak_finder.py", line 98, in find_max_1D
    N_time, N_chan = d_gpu.shape
ValueError: too many values to unpack (expected 2)

telegraphic · 2023-04-11T01:12:37Z

Very impressive it somehow managed to flag 200% of the data! (I'm not sure how that's possible)

My guess is that the data do not have gaussian statistics, so it's all getting flagged. Preprocess STD = 0.0 will convert all data into NaN values, and then it's game over downstream.

You may be able to continue by setting 'sk_flag': False. Otherwise I can look into it if you can copy the data over to the blpc nodes at the data center?

telegraphic · 2023-04-11T01:42:57Z

or it thinks data do not have gaussian stats (could be metadata issue, or misinterpreting something)

mirosaide · 2023-04-11T17:43:44Z

Set 'sk_flag': False but still getting valueError:

ValueError: too many values to unpack (expected 2) ```

telegraphic · 2023-04-14T05:42:33Z

I'd have to look to debug this one. Can you either send me the file (e.g. upload to blpd) or let me know where you're working? (I don't have login access to ATA though)

mirosaide · 2023-04-17T19:29:00Z

Hello,
You can find one sample of the data here: https://breakthroughlisten.slack.com/archives/D04D5LF0XPV/p1681758944399619

texadactyl · 2023-04-17T19:31:19Z

@mirosaide That url has a glitch.

mirosaide · 2023-04-17T21:07:34Z

Consider the following:

https://storage.cloud.google.com/atadata/fil_60040_85265_306281860_Teegardens_Star_0001-beam0000.fil

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for filterbank files? #84

Add support for filterbank files? #84

wfarah commented Mar 26, 2023

texadactyl commented Mar 26, 2023

telegraphic commented Mar 26, 2023 •

edited

Loading

telegraphic commented Mar 27, 2023

telegraphic commented Mar 27, 2023

telegraphic commented Mar 27, 2023

telegraphic commented Mar 27, 2023

wfarah commented Mar 27, 2023

mirosaide commented Apr 8, 2023

telegraphic commented Apr 8, 2023 •

edited

Loading

mirosaide commented Apr 10, 2023

telegraphic commented Apr 11, 2023

telegraphic commented Apr 11, 2023

mirosaide commented Apr 11, 2023

telegraphic commented Apr 14, 2023

mirosaide commented Apr 17, 2023

texadactyl commented Apr 17, 2023

mirosaide commented Apr 17, 2023

Add support for filterbank files? #84

Add support for filterbank files? #84

Comments

wfarah commented Mar 26, 2023

texadactyl commented Mar 26, 2023

telegraphic commented Mar 26, 2023 • edited Loading

telegraphic commented Mar 27, 2023

telegraphic commented Mar 27, 2023

telegraphic commented Mar 27, 2023

telegraphic commented Mar 27, 2023

wfarah commented Mar 27, 2023

mirosaide commented Apr 8, 2023

telegraphic commented Apr 8, 2023 • edited Loading

mirosaide commented Apr 10, 2023

telegraphic commented Apr 11, 2023

telegraphic commented Apr 11, 2023

mirosaide commented Apr 11, 2023

telegraphic commented Apr 14, 2023

mirosaide commented Apr 17, 2023

texadactyl commented Apr 17, 2023

mirosaide commented Apr 17, 2023

telegraphic commented Mar 26, 2023 •

edited

Loading

telegraphic commented Apr 8, 2023 •

edited

Loading