Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for filterbank files? #84

Open
wfarah opened this issue Mar 26, 2023 · 17 comments
Open

Add support for filterbank files? #84

wfarah opened this issue Mar 26, 2023 · 17 comments

Comments

@wfarah
Copy link

wfarah commented Mar 26, 2023

The current output of the ATA are files in filterbank.fil format.
Is there scope for adding support for filterbank files input to hyperseti?

@texadactyl
Copy link
Contributor

@wfarah Not using rawspec to generate .h5 files directly?

@telegraphic
Copy link
Contributor

telegraphic commented Mar 26, 2023

@wfarah you want

from hyperseti.io import from_fil

@telegraphic
Copy link
Contributor

(I'll update docs!)

@telegraphic
Copy link
Contributor

You can also use load_data:

def load_data(filename: str) -> DataArray:
    """ Load file and return data array
    Args:
        filename (str): Name of file to load. 
    
    Returns:
        ds (DataArray): DataArray object
    
    Notes:
        This method also supports input as a setigen Frame,
        or as an existing DataArray.
    """
    if isinstance(filename, DataArray):
        ds = filename           # User has actually supplied a DataArray
    elif isinstance(filename, str):
        if h5py.is_hdf5(filename):
            ds = from_h5(filename)
        elif sigproc.is_filterbank(filename):
            ds = from_fil(filename)
        else:
            raise RuntimeError("Only HDF5/filterbank files or DataArray/setigen.Frame objects are currently supported")
    elif isinstance(filename, stg.Frame):
        ds = from_setigen(filename)
    else:
        raise RuntimeError("Only HDF5/filterbank files or DataArray/setigen.Frame objects are currently supported")
    return ds

@telegraphic
Copy link
Contributor

I should also document how to use the from_metadata function. Planning on adding from_bifrost too, which will take a bifrost _tensor dictionary and generate a DataArray.

In terms of working with Filterbank files, not too much to note, except that it's different to blimpy. The data are loaded using np.memmap, i.e. not loaded but memory-mapped by default. Low bitwidths (<8 bits) are not supported for this reason.

You can see the source code here, it's refreshingly short:
https://github.com/UCBerkeleySETI/hyperseti/blob/master/hyperseti/io/filterbank.py

The from blimpy.io import sigproc module is copied over from blimpy.

Let me know how you go!

@telegraphic
Copy link
Contributor

PS: Here's a script I'm running on blpc0 to see how it fares on multiple files (random old data from Parkes GP survey). I've set a super high drift rate (100 Hz/s), and I'm using apply_smearing_corr = True to account for doppler smearing, and n_blank = 8 so it searches for low S/N signals after blanking higher S/N signals:

import glob, os, pprint
from hyperseti import find_et
from hyperseti.io import load_config
from hyperseti.io.hit_db import HitDatabase

config_file = 'pksmb.yaml'              # Pipeline config file to load
hit_db      = 'pks_mb.hitdb'            # Output HitDatabase name
fpath       = '/datag/collate_mb/PKS_0240_2017-12-21T19:00/'
hires_ext   = '*.0000.hires.hdf'        # Extension to use to ID hi-freq res data
gpu_id      = 3                         # GPU to attach to (blpc nodes can be busy!)

# Load config and glob folder list
blc_folders = sorted(glob.glob(os.path.join(fpath, 'blc*')))
config = load_config(config_file)
pprint.pprint(config)

# Create new hit database
db = HitDatabase(hit_db, mode='w')

# Loop through blcXX folders, and search for hires H5 files
# (these data are older )
for ff, folder in enumerate(blc_folders):
    filelist = glob.glob(os.path.join(folder, hires_ext))
    for ii, filename in enumerate(filelist):
        print(f"(node {ff} / {len(blc_folders)}: file {ii + 1}/{len(filelist)}) Opening {filename}...")
        
        # A slightly more friendly observation ID, prepending blcXX to ensure ID is unique
        blc_id = os.path.basename(folder)
        obs_id = os.path.basename(filename)
        obs_id = obs_id.replace('guppi_', '').replace(hires_ext,'')
        obs_id = f'{blc_id}_{obs_id}'

        # Search for hits, file by file. Do not save to file here as we save to HitDatabase
        hit_browser = find_et(filename, config, 
                        gulp_size=2**20,
                        gpu_id=3,
                        filename_out=None, log_output=False, log_config=False
                        )

        # Save to the HitDatabase 
        print(f"Saving to obs_id: {obs_id}")
        hit_browser.to_db(db, obs_id=obs_id)

And the config file:

preprocess:
  blank_edges:
    n_chan: 1024
  normalize: true
  sk_flag: 
    n_sigma: 10
  blank_extrema: 
    threshold: 10000
  poly_fit: 5
dedoppler:
  apply_smearing_corr: true
  kernel: ddsk
  max_dd: 100.0
  min_dd: null
  plan: stepped
hitsearch:
  min_fdistance: null
  threshold: 20
pipeline:
  merge_boxcar_trials: true
  n_boxcar: 1
  n_blank: 8

@wfarah
Copy link
Author

wfarah commented Mar 27, 2023

We have a real-time beamformer/upchannelizer that we usually utilize, and we are currently writing out .fil files. We would like to ultimately write .h5 files as that appears to be more practical for many reasons.

In the meantime, I think we can use the above example, so thanks for providing it!

@mirosaide
Copy link

I have tried to run the following:

from IPython.display import display
import time
from hyperseti.io import load_data

start_time = time.time()

path = load_data("fil_60019_75483_194942626_alf_Aql_0001-beam0000.fil")

config = {
    'preprocess': {
        'sk_flag': True,                        # Apply spectral kurtosis flagging
        'normalize': True,                      # Normalize data
        'blank_edges': {'n_chan': 32},          # Blank edges channels
        'blank_extrema': {'threshold': 10000}   # Blank ridiculously bright signals before search
    },
    'dedoppler': {
        'kernel': 'ddsk',                       # Doppler + kurtosis doppler (ddsk)
        'max_dd': 10.0,                         # Maximum dedoppler delay, 10 Hz/s
        'apply_smearing_corr': True,            # Correct  for smearing within dedoppler kernel
        'plan': 'stepped'                       # Dedoppler trial spacing plan (stepped = less memory)
    },
    'hitsearch': {
        'threshold': 20,                        # SNR threshold above which to consider a hit
    },
    'pipeline': {
        'merge_boxcar_trials': True             # Merge hits at same frequency that are found in multiple boxcars
    }
}

hit_browser = find_et(path, config, gulp_size=2**20)
display(hit_browser.hit_table)

hit_browser.view_hit(0, padding=128, plot='dual')

end_time = time.time()

runtime = (end_time - start_time)/60

print(f"Runtime: {runtime:.2f} `minutes") ```


Then got the following error:

cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 9,500,098,560 bytes (allocated so far: 3,196,059,648 bytes).

@telegraphic
Copy link
Contributor

telegraphic commented Apr 8, 2023

@mirosaide you may have guessed it, but you ran out of memory on the GPU!

You'll need to either 1) free up memory on the GPU if other things are running 2) narrow your doppler drift range down (maybe try +/- 1 Hz/s and see if it fits on?), or 3) choose a smaller gulp size (2**18).

BTW, iterative blanking is faster now, and I would suggest using it if you're searching out to +/- 10 Hz. This config goes into config['pipeline']:

    'pipeline': {
        'merge_boxcar_trials': True,
        'blank_hits': {'n_blank': 4, 'padding': 16}
    }

@mirosaide
Copy link

I tried again with:

from hyperseti.pipeline import find_et
from IPython.display import display
import time
from hyperseti.io import load_data



start_time = time.time()

path = load_data("fil_60031_05757_253968139_WISE_J223617.59+510551._0001-beam0000.fil")


config = {
    'preprocess': {
        'sk_flag': True,                        # Apply spectral kurtosis flagging
        'normalize': True,                      # Normalize data
        'blank_edges': {'n_chan': 32},          # Blank edges channels
        'blank_extrema': {'threshold': 10000}   # Blank ridiculously bright signals before search
    },
    'dedoppler': {
        'kernel': 'ddsk',                       # Doppler + kurtosis doppler (ddsk)
        'max_dd': 1.0,                         # Maximum dedoppler delay, 10 Hz/s
        'apply_smearing_corr': True,            # Correct  for smearing within dedoppler kernel
        'plan': 'stepped'                       # Dedoppler trial spacing plan (stepped = less memory)
    },
    'hitsearch': {
        'threshold': 20,                        # SNR threshold above which to consider a hit
    },
    'pipeline': {
        'merge_boxcar_trials': True,
        'blank_hits': {'n_blank': 4, 'padding': 16}             # Merge hits at same frequency that are found in multiple boxcars
    }
}

hit_browser = find_et(path, config, gulp_size=2**18)
display(hit_browser.hit_table)

hit_browser.view_hit(0, padding=16, plot='dual')

end_time = time.time()

runtime = (end_time - start_time)/60

print(f"Runtime: {runtime:.2f} minutes")

Now getting:

[find_et] find_et: hyperseti version 0.8.0
[find_et] Progress 1/192
[hyperseti.normalize] High flagged fraction: 2.000
[hyperseti.normalize] Too much data flagged: 2.000
[find_et] 	Preprocess mean:       [1. 1.]
[find_et] 	Preprocess STD:        [0. 0.]
[find_et] 	Preprocess flagged:    200.00%
Traceback (most recent call last):
  File "/mnt/buf0/msaide/test_hyperseti.py", line 35, in <module>
    hit_browser = find_et(path, config, gulp_size=2**18)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/utils.py", line 47, in wrapper
    res = func(*arg, **kwargs)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/pipeline.py", line 332, in find_et
    hits = pipeline.run()
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/utils.py", line 47, in wrapper
    res = func(*arg, **kwargs)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/pipeline.py", line 234, in run
    self.hitsearch()
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/utils.py", line 47, in wrapper
    res = func(*arg, **kwargs)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/pipeline.py", line 181, in hitsearch
    _peaks = hitsearch(self.dedopp, **conf['hitsearch'])
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/hits.py", line 227, in hitsearch
    intensity, fcoords, dcoords = peak_find(imgdata, threshold=threshold, min_spacing=min_fdistance)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/kernels/peak_finder.py", line 293, in peak_find
    find_max_1D(dedopp_data, maxval_gpu, maxidx_gpu)
  File "/opt/mnt/miniconda3/envs/hyperseti/lib/python3.10/site-packages/hyperseti/kernels/peak_finder.py", line 98, in find_max_1D
    N_time, N_chan = d_gpu.shape
ValueError: too many values to unpack (expected 2)

@telegraphic
Copy link
Contributor

Very impressive it somehow managed to flag 200% of the data! (I'm not sure how that's possible)

My guess is that the data do not have gaussian statistics, so it's all getting flagged. Preprocess STD = 0.0 will convert all data into NaN values, and then it's game over downstream.

You may be able to continue by setting 'sk_flag': False. Otherwise I can look into it if you can copy the data over to the blpc nodes at the data center?

@telegraphic
Copy link
Contributor

  • or it thinks data do not have gaussian stats (could be metadata issue, or misinterpreting something)

@mirosaide
Copy link

Set 'sk_flag': False but still getting valueError:

ValueError: too many values to unpack (expected 2) ```

@telegraphic
Copy link
Contributor

I'd have to look to debug this one. Can you either send me the file (e.g. upload to blpd) or let me know where you're working? (I don't have login access to ATA though)

@mirosaide
Copy link

Hello,
You can find one sample of the data here: https://breakthroughlisten.slack.com/archives/D04D5LF0XPV/p1681758944399619

@texadactyl
Copy link
Contributor

@mirosaide That url has a glitch.

@mirosaide
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants