Tweet Ingestion - Insights Coding Challenge Submitted by Victor Chiapaikeo

Overview

Tweet Ingestion application designed to efficiently ingest a high volume of tweets and perform processing which includes the following:

Produce a file that groups words from tweets and counts their frequency. Example output is as follows:

	analytics  		    		1
	bigdata 					3
	kdn 						1
	smb 						1

Produce a running median of unique word counts from tweets. Example output is as follows:

	11.0
	12.5
	14.0

Setup

This code is portable across the following OS's: Linux distributions, Mac and Windows OS's. Scripts were written using Python 2.7 and have not been tested for portability to Python 3.X.

You are encouraged to use a python virtual environment using virtualenv and pip. NOTE (2015-07-18): As of now, the requirements file is empty because no modules outside the default build are used.

$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements

Description of modules imported and application

os - operating system interface related and used to port-ably join paths, remove files, etc.
sys - interpreter-related and used to parse parameters from the commandline
heapq - priority queue algorithm used to efficiently obtain root from an array
unittest - framework used to create and employ unit tests

Check out and run

Applications can be run separately or together from a shell script.

To run separately:

Both words_tweeted.py and median_unique.py accepts the same two parameters:

Input file: this is the file containing tweets separated by newlines and located in
Output file: this is the file that is produced and located

$ git clone https://github.com/vchiapaikeo/tweet_ingestion.git
$ cd tweet_ingestion
$ python src/words_tweeted.py tweet_input/tweets.txt tweet_output/ft1.txt
$ python src/median_unique.py tweet_input/tweets.txt tweet_output/ft1.txt

To run from shell script:

This scenario is simpler and will execute both scripts back to back.

$ ./run.sh

Test

Unit tests have been created to test functions within the app. A test should be executed on the commandline at the top-level dir. Two tests are available corresponding to each of module in src and can be executed as follows:

$ cd tweet_ingestion
$ python -m src.tests.unit_words_tweeted

$ cd tweet_ingestion
$ python -m src.tests.unit_median_unique

The following output should return:

..
----------------------------------------------------------------------
Ran 2 tests in 0.003s

OK

Next Steps (how you can help!)

It goes without saying that this script is a work in progress. A number of items could still be added to increase functionality, performance, and robustness of this script. A few of my favorite wish-list items are listed.

Reduce memory footprint in src/median_unique.py - the current implementation (submitted 2015-07-19) retains lists of streaming unique word counts per line in memory...
Support multiple files in input dir using glob? Could this be something we want?
Add more tests!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Tweet Ingestion - Insights Coding Challenge Submitted by Victor Chiapaikeo

Overview

Setup

Description of modules imported and application

Check out and run

Test

Next Steps (how you can help!)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Tweet Ingestion - Insights Coding Challenge Submitted by Victor Chiapaikeo

Overview

Setup

Description of modules imported and application

Check out and run

Test

Next Steps (how you can help!)