Skip to content

fdw for orc file (not in hdfs). Selecting certain columns and skip stripes are not supported now. Based on orc c++ library, cstore_fdw, gokhankici/orc_fdw

Notifications You must be signed in to change notification settings

cjqhenry14/server_orc_fdw

Repository files navigation

Develop Resources

This fdw is based on these codes:

  1. [Refer][gokhankici_orc_fdw] https://github.com/gokhankici/orc_fdw, another orc_fdw, but only supports hive 0.11.
  2. [Refer][cstore_fdw] https://github.com/citusdata/cstore_fdw, another orc_fdw, may use self defined orc format.
  3. [Use]https://github.com/apache/orc/tree/master, apache orc c++ library.
  4. [Use]https://github.com/cjqhenry14/localOrcCppLib, I modified the apache orc c++ library to support fdw.
  5. [Refer] http://doxygen.postgresql.org/file__fdw_8c_source.html, file_fdw is an official fdw for reading normal files.

complie & install

./exe

use it

  1. login postgresql
    psql fdw_db mzhong

  2. create extension
    CREATE EXTENSION orc_fdw;

  3. create orc server
    create server orc_server foreign data wrapper orc_fdw;

  4. create table for test_data1.orc (transferred from test_data1.txt) create foreign table test_data1_orc (id INT, name VARCHAR(20), state CHAR(2), salary DOUBLE PRECISION, birthday DATE) server orc_server options(filename '/usr/pgsql-9.4/test_data1.orc');

  5. do the query:
    select * from test_data1_orc; // PASSED

code introduction

For server_orc_fdw's code:

  1. orcInclude/: the head files from orc c++ lib.

  2. orcLib/: static lib files generated by my modified orc c++ lib https://github.com/cjqhenry14/localOrcCppLib.
    pre*.a are non-use, just for back-up.

  3. testDataFile/: simple orcfile for testing.

  4. caller.c: just for testing.

  5. exe: shell script, for compiling and running.

  6. orcLibBridge.*: the bridge between fdw and orc lib, connecting c code with c++ code.

The code introduction of apache orc c++ lib for fdw is described here:
https://github.com/cjqhenry14/localOrcCppLib.
Don't forget your modified orc lib should be synchronized with the code and static lib file in your fdw.

current bugs

Now it can pass the test case of single table query in TPC-H, but the program will fail in other cases.
For example:

  1. select s_suppkey, n_nationkey from supplier, nation limit 100000; //Fail
  2. select s_suppkey, n_nationkey from supplier, nation limit 20000; //Pass

I have tried many times, but still haven't figured out the reason, I "guess" its memory related problem.
PostgreSql has its own memory management methods,
see: http://blog.pgaddict.com/posts/introduction-to-memory-contexts.

But apache orc c++ lib is written is c++, might not be compatible with PostgreSql's c code.
In the current version's code, you should notice that I might not use the memory management API in the right way,
because after some modifications, still failed, so I just rollback to a earlier version.

The bug may have other reasons...

future work

  1. Select certain columns from the orc file. Should be implemented in both fdw and orc c++ lib.
    May refer to gokhankici_orc_fdw and cstore_fdw, also should study the orc c++ lib's API.

  2. Skip batches.
    May refer to gokhankici_orc_fdw and cstore_fdw, also should study the orc c++ lib's API.

  3. Add libhdfs to parse orcfile stored in HDFS:
    Replace all the read and open related functions with libhdfs API. To compile with libhdfs,
    you may refer to https://github.com/cjqhenry14/myfile_fdw/blob/master/Makefile
    To connect HDFS, you may use: hdfsFS fs = hdfsConnect("130.245.130.190", 8020);

  4. Refine other necessary functions and APIs of the fdw, e.g. fileGetForeignPaths().
    You can refer to other fdws, especially gokhankici_orc_fdw and cstore_fdw.

  5. Pass TPC-H test cases.
    TPC-H tables are in 'fdw_orc_tpch_1g' database, login using 'psql fdw_orc_tpch_1g mzhong'.

suggestions

  1. Be familiar with the build and run methods for using fdw.
    Start with the simple fdw (file_fdw).

  2. There may be several APIs should be implemented in the fdw, which is not easy.
    So at first, you can focus on BeginForeignScan() and IterateForeignScan().
    BeginForeignScan() is mainly for preparing the file and information for parsing.
    IterateForeignScan() gets one tuple record from the file.

  3. Now it can't print or log something for debugging.
    It's access privilege problem, PostgreSql can't write something to the local files.
    You can ask server admin for help.

About

fdw for orc file (not in hdfs). Selecting certain columns and skip stripes are not supported now. Based on orc c++ library, cstore_fdw, gokhankici/orc_fdw

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages