Skip to content

Chunking of larger tables in DB

Nikhil VJ edited this page Sep 6, 2018 · 2 revisions

Documenting a major development in this project that took a long time to fine-tune and got finalized with v2.0.0 in Sep 2018, but was in backburner development since May 2018.
Will share documentation regarding it here.

Basic problem statement

The stop_times.txt file (and shapes.txt also in some cases) can be HUGE, million+ rows Reading and writing these took extremely long time, and in many cases, simply errored out with a memory error, in the older json-based DB system. And the json files created there were also huge. This is given in detail in #4 .

Change in DB

This prompted a transition from json-based db to HDF5 format which took care of a lot of things at the DB storage and retrieval level. But the problem of working with a very large table in memory remained. v.long waiting time and constant threat of out-of-memory errors.

Chunking method

So from there, an idea for CHUNKING emerged.

  • tables like stop_times.txt and shapes.txt are anyways never actually handled in full. ie, at no point in managing GTFS are you processing the entire table. You're just working with one trip_id's rows in stop_times, and one shape_id's rows in shapes.
  • So why not store them in chunks
  • Split the stop_times file into multiple chunks: stop_times_1.h5, stop_times_2.h5, ... having a million or half a million rows each.
  • make a lookup json having the trip_id's mapped to the file containing them.

Note: from this point onwards, stop_times and trip_id will be mentioned but that should be assumed to apply to any table that's being chunked, and the corresponding primary key.

Then, when reading the table,

  • load the lookup json and figure out which chunk our trip_id is mapped to.
  • load that chunk only, leave the remaining data alone.

Chunking the tables at import

When creating the data for the first time during a feed import, this chunking feature was leveraged : the csv itself was read in chunks, and each chunk is processed separately and deleted from memory before loading the next chunk. This way huge feeds having huge stop_times.txt files can be read by the program without requiring large memory hardware.

Contiguous Chunks

  • one challenge : how to ensure that one trip_id doesn't spill over from one chunk to the other.
  • See the code for this in the GTFS import function:

https://github.com/WRI-Cities/static-GTFS-manager/blob/8c9a8f30e482a3dc1e4462ebba85beb9c180ce86/GTFSserverfunctions.py#L168

Sample console output of an import:

Storing stop_times in chunks.
stop_times_1.h5: 499987 rows
stop_times_2.h5: 500004 rows
stop_times_3.h5: 499994 rows
stop_times_4.h5: 500007 rows
stop_times_5.h5: 499993 rows
stop_times_6.h5: 500007 rows
stop_times_7.h5: 500004 rows
stop_times_8.h5: 499970 rows
stop_times_9.h5: 500013 rows
stop_times_10.h5: 500020 rows
stop_times_11.h5: 4333 rows
Appending the 22 rows of last ID to last chunk stop_times_11.h5
Lookup json: stop_times_lookup.json created for mapping ID trip_id to stop_times_n.h5 chunk files.

An explanation of the workflow:

  • from the csv chunk loaded, take all trip_ids except the last one (which may be continued in the next chunk). Store these into a .h5 file.
  • store the last trip_id's rows in a "CarryOver" df.
  • when the next csv chunk is loaded, concatenate the CarryOver rows with these. Even here, repeat the earlier process.
  • this will result in chunks that have slightly less or more rows than the chosen chunk size.
  • After processing the last chunk, append the last set of CarryOver rows back to the last chunk.

(this is a lot easier to code than it is to explain, so apologies if this is still confusing)