splitting gtfs transit data into smaller ones

1k views Asked by At

I sometime have a very large size of gtfs zip file - valid for a period of 6 months, but this is not economic for loading such big data size into a low resource (for example, 2 gig of memory and 10 gig hard disk) EC2 server.

I hope to be able split this large size gtfs into 3 smaller gtfs zip files with 2 months (6months/3files) period worth of valid data, of course that means I will need to replace data every 2 months.

I have found a python program that achieve the opposite goal MERGE here https://github.com/google/transitfeed/blob/master/merge.py (this is a very good python project btw.)

I am very thankful for any pointer.

Best regards,

Dunn.

2

There are 2 answers

1
Brian Ferris On BEST ANSWER

It's worth noting that entries in stop_times.txt are usually the biggest memory hog when it comes to loading a GTFS feed. Since most systems do not replicate trips+stop_times for the dates when those trips are active, reducing the service calendar probably won't save you much.

That said, there are some tools for slicing and dicing GTFS. Check out the OneBusAway GTFS Transformer tool, for example:

http://developer.onebusaway.org/modules/onebusaway-gtfs-modules/1.3.3/onebusaway-gtfs-transformer-cli.html

0
Drew Dara-Abrams On

Another, more recent option for processing large GTFS files is transitland-lib. It's written in the Go programming language, which is quite efficient at parsing huge GTFS feeds.

See the transitland extract command, which can take a number of arguments to cut an existing GTFS feed down to smaller size:

% transitland extract --help
Usage: extract <input> <output>
  -allow-entity-errors
        Allow entities with errors to be copied
  -allow-reference-errors
        Allow entities with reference errors to be copied
  -create
        Create a basic database schema if none exists
  -create-missing-shapes
        Create missing Shapes from Trip stop-to-stop geometries
  -ext value
        Include GTFS Extension
  -extract-agency value
        Extract Agency
  -extract-calendar value
        Extract Calendar
  -extract-route value
        Extract Route
  -extract-route-type value
        Extract Routes matching route_type
  -extract-stop value
        Extract Stop
  -extract-trip value
        Extract Trip
  -fvid int
        Specify FeedVersionID when writing to a database
  -interpolate-stop-times
        Interpolate missing StopTime arrival/departure values
  -normalize-service-ids
        Create Calendar entities for CalendarDate service_id's
  -set value
        Set values on output; format is filename,id,key,value
  -use-basic-route-types
        Collapse extended route_type's into basic GTFS values