Support feature selection #338

amaya382 · 2016-09-08T06:45:59Z

Feature selection

Feature selection is the process of selecting a subset consisting of influential features from multiple features. It is an important technique to enhance results, shorten training time and make features human-understandable.

Currently, following is temporary I/F

Candidates for internal selecting methods

chi2(for non-negative data only)
SNR
~~mRMR~~

Common

[UDAF] `transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>`

Input

array X	array Y
a row of matrix	a row of matrix

Output

array<array> dotted
`dot(X.T, Y)`, shape = (X.#cols, Y.#cols)

[UDF] `select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>`

Input

array X	array importance list	int k
array	the larger, the more important	top-?

Output

array<array> k-best elements
top-k elements from X based on indices of importance list

/***********************************************************************

Note

Current implementation expects _ALL each importance_list and k are equal_. It maybe confuse us.
- Future WA: add option showing use of common importance_list and k

***********************************************************************/

chi2

[UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>`

Input

both observed and expected, shape = (#classes, #features)

array observed	array expected
observed features	expected features, `dot(class_prob.T, feature_count)`

Output

struct<array, array> importance lists
chi2-values and p-values each feature, each shape = (1, #features)

Example - chi2

CREATE TABLE input (
  X array<double>, -- features
  Y array<int> -- binarized label
);

WITH stats AS (
  SELECT
    -- [UDAF] transpose_and_dot(Y::array<number>, X::array<number>)::array<array<double>>
    transpose_and_dot(Y, X) AS observed, -- array<array<double>>, shape = (n_classes, n_features)
    array_sum(X) AS feature_count, -- n_features col vector, shape = (1, array<double>)
    array_avg(Y) AS class_prob -- n_class col vector, shape = (1, array<double>)
  FROM
    input
),
test AS (
  SELECT
    transpose_and_dot(class_prob, feature_count) AS expected -- array<array<double>>, shape = (n_class, n_features)
  FROM
    stats
),
chi2 AS (
  SELECT
    -- [UDAF] chi2(observed::array<array<double>>, expected::array<array<double>>)::struct<array<double>, array<double>>
    chi2(observed, expected) AS chi2s -- struct<array<double>, array<double>>, each shape = (1, n_features)
  FROM
    test JOIN stats;
)
SELECT
  -- [UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>
  select_k_best(X, chi2s.chi2, 2) -- top-2 feature selection based on chi2 score
FROM
  input JOIN chi2;

SNR

[UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`

Input

array X	array Y
a row of matrix, overall shape = (#samples, #features)	a row of one-hot matrix, overall shape = (#samples, #classes)

Output

array importance list
snr values of each feature, shape = (1, #features)

Note

There is no need to one-hot vectorize, but fitting its interface to chi2's one

Example - snr

CREATE TABLE input (
  X array<double>, -- features
  Y array<int> -- binarized label
);

WITH snr AS (
  -- [UDAF] snr(features::array<number>, labels::array<int>)::array<double>
  SELECT snr(X, Y) AS snr FROM input -- aggregated SNR as array<double>, shape = (1, #features)
)
SELECT select_k_best(X, snr, ${k}) FROM input JOIN snr;

The text was updated successfully, but these errors were encountered:

myui · 2016-09-08T07:56:18Z

@amaya382 depends on input format of each algorithm. If we can use same format as the input, better to provide a feature_selection function w/ an algorithm option.

Also, how to apply distributed processing is another issue. mRMR is processed in parallel by MapReduce in some implementation.

myui · 2016-09-15T10:33:44Z

@amaya382 we can do better as follows:

X::relation(array<double>)  # n_features * n_examples matrix
Y::relation(array<int>)     # n_class * n_examples matrix

observed = dotp(Y.T, X)     # n_classes * n_features matrix
UDAF transpose_and_dot(Y, X)::array<array<double>>

feature_count = X.sum       # n_features col vector
UDAF array_sum(X)::array<double>

class_prob = Y.mean         # n_class col vector
UDAF array_avg(Y)::array<double>

expected = dotp(class_prob.T, feature_count)    # n_class * n_features matrix
UDAF transpose_and_dot(class_prob, feature_count)::array<array<double>>

chi2,pval = chisquare(observed, expected)   # n_class * n_features => n_features
UDAF chi2(observerd::array<double>, expected::array<double>)::struct<array<double>,array<double>>

chi2    # n_features col vector
pval    # n_features col vector

create table input (
  X array<double>, -- features
  Y array<int> -- binarized label 
);

WITH stats (
  select
    -- UDAF transpose_and_dot(Y::array<double>, X::array<double>)::array<array<double>>
    transpose_and_dot(Y, X) as observed, -- array<array<double>> # n_classes * n_features matrix
    array_sum(X) as feature_count, -- n_features col vector # array<double>
    array_avg(Y) as class_prob -- n_class col vector # array<double>
  from
    input
),
test as (
  select
    observed, 
    -- UDAF transpose_and_dot(Y::array<double>, X::array<double>)::array<array<double>>
    transpose_and_dot(class_prob, feature_count) as expected -- array<array<double>> # n_class * n_features matrix
  from
    stats
),
select
  -- UDAF chi2(observerd::array<double>, expected::array<double>)::struct<array<double>,array<double>> 
  chi2(observed, expected) as (chi2, pval) -- struct<array<double>,array<double>> # n_features
from
  test;

select
  select_k_best(X, T.chi2, ${k}) as X, -- feature selection based on chi2 score
  Y
from
  input
  CROSS JOIN chi2 T
;

What you have to develop is just transpose_and_dot UDAF and chi2 UDAF.
transpose_and_dot is bits tricky but can be incrementally implemented.

amaya382 · 2016-09-16T01:48:54Z

@myui okay, I'll try this strategy

myui · 2016-09-16T07:26:37Z

@amaya382 could you validate your implementation to other chi2 implementation (e.g., scikit-learn) in the unit and system test?

amaya382 · 2016-09-20T02:46:04Z

@myui

chi2 (Iris dataset)

results by hivemall with systemtest, which exec query actually

	f0	f1	f2	f3
chi2_vals	10.817820878493995	3.5944990176817315	116.16984746363957	67.24482558215503
p_vals	0.004476514990225833	0.16575416718561453	0.0	2.55351295663786E-15

results by scikit-learn

	f0	f1	f2	f3
chi2_vals	10.81782088	3.59449902	116.16984746	67.24482759
p_vals	4.47651499e-03	1.65754167e-01	5.94344354e-26	2.50017968e-15

amaya382 · 2016-09-26T08:47:52Z

@myui

SNR (Iris dataset)

results by hivemall with systemtest, which exec query actually, incremental algorithm

	f0	f1	f2	f3
aggregated SNR	3.2700426804423826	1.9034599450433007	11.359577906085278	9.794847727147216

results by python-numpy, batch algorithm

	f0	f1	f2	f3
aggregated SNR	3.27004268	1.90345995	11.35957791	9.79484773

Also already tested on EMR, worked properly

myui added enhancement discussions WIP labels Sep 8, 2016

myui added this to the v0.4 milestone Sep 8, 2016

myui self-assigned this Sep 8, 2016

amaya382 mentioned this issue Sep 20, 2016

Implement feature selection #352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support feature selection #338

Support feature selection #338

amaya382 commented Sep 8, 2016 •

edited by myui

Loading

myui commented Sep 8, 2016

myui commented Sep 15, 2016 •

edited

Loading

amaya382 commented Sep 16, 2016

myui commented Sep 16, 2016 •

edited

Loading

amaya382 commented Sep 20, 2016 •

edited

Loading

amaya382 commented Sep 26, 2016

Support feature selection #338

Support feature selection #338

Comments

amaya382 commented Sep 8, 2016 • edited by myui Loading

Feature selection

Candidates for internal selecting methods

Common

[UDAF] transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>

Input

Output

[UDF] select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>

Input

Output

Note

chi2

[UDF] chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>

Input

Output

Example - chi2

SNR

[UDAF] snr(X::array<number>, Y::array<int>)::array<double>

Input

Output

Note

Example - snr

myui commented Sep 8, 2016

myui commented Sep 15, 2016 • edited Loading

amaya382 commented Sep 16, 2016

myui commented Sep 16, 2016 • edited Loading

amaya382 commented Sep 20, 2016 • edited Loading

chi2 (Iris dataset)

results by hivemall with systemtest, which exec query actually

results by scikit-learn

amaya382 commented Sep 26, 2016

SNR (Iris dataset)

results by hivemall with systemtest, which exec query actually, incremental algorithm

results by python-numpy, batch algorithm

amaya382 commented Sep 8, 2016 •

edited by myui

Loading

[UDAF] `transpose_and_dot(X::array<number>, Y::array<number>)::array<array<double>>`

[UDF] `select_k_best(X::array<number>, importance_list::array<int> k::int)::array<double>`

[UDF] `chi2(observed::array<array<number>>, expected::array<array<number>>)::struct<array<double>, array<double>>`

[UDAF] `snr(X::array<number>, Y::array<int>)::array<double>`

myui commented Sep 15, 2016 •

edited

Loading

myui commented Sep 16, 2016 •

edited

Loading

amaya382 commented Sep 20, 2016 •

edited

Loading