-
Notifications
You must be signed in to change notification settings - Fork 0
bencomp/oldumpscripts
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Open Library dump scripts A collection of scripts to extract statistics from Open Library dump files. Stats.py: produce statistics of a dump file stats.py reads the standard in, line by line. It expects a complete JSON record, so before feeding dump files, you should remove everything before the JSON record. For example: sed -nre "s/^[^{]*//p" <ol_dump_file> | python stats.py output.json During execution, it keeps the statistics in a dict. Each type found in the dump, except the ones with confused identities, gets a key in this dict. The values for these keys are dicts themselves, with keys: countr - an int count of records (of this type), keys - a dict with keys found in the records as keys and a list as value. The list contains the number of records each key is found in, followed by the number of values: if the specific key has a list value in the records, the length of all lists is accumulated; otherwise this is the same as the number of records, identifiers - a dict with keys found in the identifiers object as keys and the number of records and the number of instances of each key as value, si - a dict with identifiers found in the record object as keys and a list of the number of records and the number of instances of each key as value, classifications - same as for identifiers, but for classifications, sc - same as for si, but for sc. Keys and types of records with confused identities are in a list under key confused. If an exception is caught during processing of a record, a 2-tuple containing the complete record and the exception message is appended to the list under key error. Exportcsv.py: export data from JSON stats file to separate CSV files Expects a file generated by stats.py. Countformats.py: count the values in the physical_format field Expects Edition JSON records, outputs a tab separated UTF-8 file.
About
Open Library dump scripts
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published