Skip to content
This repository has been archived by the owner on Oct 8, 2019. It is now read-only.

Support feature selection #338

Open
amaya382 opened this issue Sep 8, 2016 · 6 comments
Open

Support feature selection #338

amaya382 opened this issue Sep 8, 2016 · 6 comments

Comments

@amaya382
Copy link
Contributor

amaya382 commented Sep 8, 2016

Feature selection

Feature selection is the process of selecting a subset consisting of influential features from multiple features. It is an important technique to enhance results, shorten training time and make features human-understandable.

Currently, following is temporary I/F

Candidates for internal selecting methods

  • chi2(for non-negative data only)
  • SNR
  • mRMR

Common

[UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>

Input

array X array Y
a row of matrix a row of matrix

Output

array<array> dotted
dot(X.T, Y), shape = (X.#cols, Y.#cols)

[UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>

Input

array X array importance list int k
array the larger, the more important top-?

Output

array<array> k-best elements
top-k elements from X based on indices of importance list

/***********************************************************************

Note

  • Current implementation expects _ALL each importance_list and k are equal_. It maybe confuse us.
    • Future WA: add option showing use of common importance_list and k

***********************************************************************/

chi2

[UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>

Input

both observed and expected, shape = (#classes, #features)

array observed array expected
observed features expected features, dot(class_prob.T, feature_count)

Output

struct<array, array> importance lists
chi2-values and p-values each feature, each shape = (1, #features)

Example - chi2

CREATE TABLE input (
  X array<double>, -- features
  Y array<int> -- binarized label
);

WITH stats AS (
  SELECT
    -- [UDAF] transpose_and_dot(Y::array<number>, X::array<number>)::array<array<double>>
    transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features)
    array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>)
    array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>)
  FROM
    input
),
test AS (
  SELECT
    transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features)
  FROM
    stats
),
chi2 AS (
  SELECT
    -- [UDAF] chi2(observed::array<array<double>>, expected::array<array<double>>)::struct<array<double>, array<double>>
    chi2(observed, expected) AS chi2s -- struct<array<double>, array<double>>, each shape = (1, n_features)
  FROM
    test JOIN stats;
)
SELECT
  -- [UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>
  select_k_best(X, chi2s.chi2, 2) -- top-2 feature selection based on chi2 score
FROM
  input JOIN chi2;

SNR

[UDAF] snr(X::array<number>, Y::array<int>)::array<double>

Input

array X array Y
a row of matrix, overall shape = (#samples, #features) a row of one-hot matrix, overall shape = (#samples, #classes)

Output

array importance list
snr values of each feature, shape = (1, #features)

Note

  • There is no need to one-hot vectorize, but fitting its interface to chi2's one

Example - snr

CREATE TABLE input (
  X array<double>, -- features
  Y array<int> -- binarized label
);

WITH snr AS (
  -- [UDAF] snr(features::array<number>, labels::array<int>)::array<double>
  SELECT snr(X, Y) AS snr FROM input -- aggregated SNR as array<double>, shape = (1, #features)
)
SELECT select_k_best(X, snr, ${k}) FROM input JOIN snr;
@myui myui added this to the v0.4 milestone Sep 8, 2016
@myui myui self-assigned this Sep 8, 2016
@myui
Copy link
Owner

myui commented Sep 8, 2016

@amaya382 depends on input format of each algorithm. If we can use same format as the input, better to provide a feature_selection function w/ an algorithm option.

Also, how to apply distributed processing is another issue. mRMR is processed in parallel by MapReduce in some implementation.

@myui
Copy link
Owner

myui commented Sep 15, 2016

@amaya382 we can do better as follows:

X::relation(array<double>)  # n_features * n_examples matrix
Y::relation(array<int>)     # n_class * n_examples matrix

observed = dotp(Y.T, X)     # n_classes * n_features matrix
UDAF transpose_and_dot(Y, X)::array<array<double>>

feature_count = X.sum       # n_features col vector
UDAF array_sum(X)::array<double>

class_prob = Y.mean         # n_class col vector
UDAF array_avg(Y)::array<double>

expected = dotp(class_prob.T, feature_count)    # n_class * n_features matrix
UDAF transpose_and_dot(class_prob, feature_count)::array<array<double>>

chi2,pval = chisquare(observed, expected)   # n_class * n_features => n_features
UDAF chi2(observerd::array<double>, expected::array<double>)::struct<array<double>,array<double>>

chi2    # n_features col vector
pval    # n_features col vector
create table input (
  X array<double>, -- features
  Y array<int> -- binarized label 
);

WITH stats (
  select
    -- UDAF transpose_and_dot(Y::array<double>, X::array<double>)::array<array<double>>
    transpose_and_dot(Y, X) as observed, -- array<array<double>> # n_classes * n_features matrix
    array_sum(X) as feature_count, -- n_features col vector # array<double>
    array_avg(Y) as class_prob -- n_class col vector # array<double>
  from
    input
),
test as (
  select
    observed, 
    -- UDAF transpose_and_dot(Y::array<double>, X::array<double>)::array<array<double>>
    transpose_and_dot(class_prob, feature_count) as expected -- array<array<double>> # n_class * n_features matrix
  from
    stats
),
select
  -- UDAF chi2(observerd::array<double>, expected::array<double>)::struct<array<double>,array<double>> 
  chi2(observed, expected) as (chi2, pval) -- struct<array<double>,array<double>> # n_features
from
  test;

select
  select_k_best(X, T.chi2, ${k}) as X, -- feature selection based on chi2 score
  Y
from
  input
  CROSS JOIN chi2 T
;

What you have to develop is just transpose_and_dot UDAF and chi2 UDAF.
transpose_and_dot is bits tricky but can be incrementally implemented.

@amaya382
Copy link
Contributor Author

@myui okay, I'll try this strategy

@myui
Copy link
Owner

myui commented Sep 16, 2016

@amaya382 could you validate your implementation to other chi2 implementation (e.g., scikit-learn) in the unit and system test?

@amaya382
Copy link
Contributor Author

amaya382 commented Sep 20, 2016

@myui

chi2 (Iris dataset)

results by hivemall with systemtest, which exec query actually

f0 f1 f2 f3
chi2_vals 10.817820878493995 3.5944990176817315 116.16984746363957 67.24482558215503
p_vals 0.004476514990225833 0.16575416718561453 0.0 2.55351295663786E-15

results by scikit-learn

f0 f1 f2 f3
chi2_vals 10.81782088 3.59449902 116.16984746 67.24482759
p_vals 4.47651499e-03 1.65754167e-01 5.94344354e-26 2.50017968e-15

@amaya382
Copy link
Contributor Author

@myui

SNR (Iris dataset)

results by hivemall with systemtest, which exec query actually, incremental algorithm

f0 f1 f2 f3
aggregated SNR 3.2700426804423826 1.9034599450433007 11.359577906085278 9.794847727147216

results by python-numpy, batch algorithm

f0 f1 f2 f3
aggregated SNR 3.27004268 1.90345995 11.35957791 9.79484773

Also already tested on EMR, worked properly

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants