A standard dataset is just a text file, with lines, where each line is a record, the fields of which are separated by a separator (eg. tabs, comma, pipe, etc). After registering a dataset and declaring the desired data analysis methods that should get executed, the system produces a 100% automatic statistical profile of the dataset and generates reports of the findings.
- Install Intellij IDEA (Community edition is free)
- Import the project as a Maven project, and it runs out of the box.
- Install Eclipse
- Import the project as a Maven project.
The project uses a Maven wrapper so there is no need to install it to your system as long as you have the JAVA_HOME environmental variable pointing to your Java 8 installation folder.
Navigate to the root folder of the repo and run,
./mvnw clean install
and wait for the procedure to finish
After that, there should be a folder called target
that includes two jar files:
Pythia-x.y.z-all-deps.jar and Pythia-x.y.z.jar
The difference is that the all deps jar file is an uber jar so you can import Pythia to a project and run it out of the box. (All dependecies are embedded into the all deps jar)
- Otherwise you will need to provide the Pythia dependencies to your pom.xml file.
To run with the driver Main method, navigate to the root folder of the repo:
java -jar target/Pythia-x.y.z-all-deps.jar
Navigate to the root folder of the repo and run,
./mvnw test
This project complies with Google's Java coding style and is formatted using the official Google java formatter. You can follow the installation guide in the official GitHub repo to install it to your Editor.
Note: Consider installing it and run it so that the project follows a coding style
In case you want to format all java files from the command line, run in the root folder of the project:
java -jar google-java-format-x.y.z-all-deps.jar -i $(find . -type f -name "*.java")
Note: The formatter needs a Java 11 installation to run in the command line
Suppose we want to generate a statistical profile of the following file:
name | age | money |
---|---|---|
Michael | 25 | 20 |
John | 21 | 15 |
Andy | 30 | 1000 |
Justin | 65 | 10000 |
Below is a sample Main class that showcases API usage in simple steps for the above dataset:
public class Main {
public static void main(String[] args) throws AnalysisException, IOException {
// 1. Initialize a DatasetProfiler object (this is the main engine interface of Pythia).
IDatasetProfiler datasetProfiler = new IDatasetProfilerFactory().createDatasetProfiler();
// 2. Specify the schema, an alias and the path of the input dataset.
StructType schema =
new StructType(
new StructField[]{
new StructField("name", DataTypes.StringType, true, Metadata.empty()),
new StructField("age", DataTypes.IntegerType, true, Metadata.empty()),
new StructField("money", DataTypes.IntegerType, true, Metadata.empty()),
});
String alias = "people";
String path = String.format(
"src%stest%sresources%sdatasets%speople.json",
File.separator, File.separator, File.separator, File.separator);
// 3. Register the input dataset specified in step 2 into Pythia.
datasetProfiler.registerDataset(alias, path, schema);
// 4. Specify labeling rules for a column and a name for the new labeled column.
List<Rule> rules =
new ArrayList<Rule>(
Arrays.asList(
new Rule("money", LabelingSystemConstants.LEQ, 20, "poor"),
new Rule("money", LabelingSystemConstants.LEQ, 1000, "mid"),
new Rule("money", LabelingSystemConstants.GT, 1000, "rich")));
String newColumnName = "money_labeled";
// 5. Create a RuleSet object and compute the new labeled column
// (steps 4 & 5 can be repeated multiple times).
RuleSet ruleSet = new RuleSet(newColumnName, rules);
datasetProfiler.computeLabeledColumn(ruleSet);
// 6. Specify the DominanceColumnSelectionMode and (optionally) a list of
// measurement & coordinate columns used in dominance pattern identification.
DominanceColumnSelectionMode mode = DominanceColumnSelectionMode.USER_SPECIFIED_ONLY;
String[] measurementColumns = new String[] { "money", "age" };
String[] coordinateColumns = new String[] { "name" };
// 7. Declare the specified dominance parameters into Pythia
// (steps 6 & 7 are optional, however, they are a prerequisite for highlight patterns identification).
datasetProfiler.declareDominanceParameters(mode, measurementColumns, coordinateColumns);
// 8. Specify the auxiliary data output directory and the desired parts of the analysis procedure
// that should get executed for the computation of the dataset profile.
String auxiliaryDataOutputDirectory = "results";
boolean shouldRunDescriptiveStats = true;
boolean shouldRunHistograms = true;
boolean shouldRunAllPairsCorrelations = true;
boolean shouldRunDecisionTrees = true;
boolean shouldRunHighlightPatterns = true;
// 9. Create a DatasetProfilerParameters object with the parameters specified in step 8
// and compute the profile of the dataset (this will take a while for big datasets).
DatasetProfilerParameters parameters = new DatasetProfilerParameters(
auxiliaryDataOutputDirectory,
shouldRunDescriptiveStats,
shouldRunHistograms,
shouldRunAllPairsCorrelations,
shouldRunDecisionTrees,
shouldRunHighlightPatterns);
datasetProfiler.computeProfileOfDataset(parameters);
// 10. (Optionally) specify an output directory path for the generated reports
// (unspecified output directory path means that the reports will be generated under the
// auxiliary data output directory specified in step 8).
String outputDirectoryPath = "";
// 11. Generate a report in plain text and markdown format.
datasetProfiler.generateReport(ReportGeneratorConstants.TXT_REPORT, outputDirectoryPath);
datasetProfiler.generateReport(ReportGeneratorConstants.MD_REPORT, outputDirectoryPath);
}
}
- Alexiou Alexandros
- Charisis Alexandros (Youtube)
- Christodoulos Antoniou
- Dimos Gkitsakis
- George Karathanos
- Panos Vassiliadis