OpenBus - Getting Started
This is an intro / tutorial notebook for getting started with GTFS data.
It is based on work that is shown on my dev blog simplistic.me. Posts there are not updated to current versions, mine or other packages'. Still, do check it out and feel free to comment or post an issue or pull request.
This is only one basic workflow which I like. You can find many more examples for working with GTFS online. There are tools and packages in many languages.
THE place to start is awesome-transit.
TODO: Add more tools to this notebook - peartree, GTFSTK, UrbanAccess, Pandana
Installation¶
TODO: Go to open-bus README (this needs to be there)
- Install Anaconda3
- Create a virtual environment (call it openbus or something indicative):
conda create -n openbus
- Clone open-bus-explore -
git clone https://github.com/cjer/open-bus-explore
- Install everything in requirements.txt
- partridge, peartree and GTFSTK require
pip install
- the rest you should install using
conda install -c conda-forge <package_name>
- partridge, peartree and GTFSTK require
- Run JupyterLab or Jupyter Notebook
Imports and config¶
# Put these at the top of every notebook, to get automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import pandas as pd
import partridge as ptg
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
import datetime
from gtfs_utils import *
alt.renderers.enable('notebook')
alt.data_transformers.enable('json')
sns.set_style("white")
sns.set_context("talk")
sns.set_palette('Set2', 10)
Getting the data¶
Three main options:
- Tap into the source: We have some functions for working with MoT's FTP.
- Our archive on Amazon S3
- TransitFeeds' Archive
You can go to our redash interface if you prefer to work with SQL (can't attest to what the state of it is at any given moment).
FTP¶
get_ftp_dir()
We have the expected file names in constant variables in gtfs_utils
GTFS_FILE_NAME
TARIFF_FILE_NAME
LOCAL_GTFS_ZIP_PATH = 'data/sample/latest_gtfs.zip'
LOCAL_TARIFF_PATH = 'data/sample/latest_tariff.zip'
get_ftp_file(file_name = GTFS_FILE_NAME, local_path = LOCAL_GTFS_ZIP_PATH)
By default we don't overide local files. But we can, by adding force=True
get_ftp_file(file_name = GTFS_FILE_NAME,
local_path = LOCAL_GTFS_ZIP_PATH,
force=True)
get_ftp_file(file_name = TARIFF_FILE_NAME,
local_path = LOCAL_TARIFF_PATH,
force=True )
S3 archive¶
Another option is to get the files from our archive on S3. For now it does require credentials.
#!aws s3 cp s3://s3.obus.hasadna.org.il/2018-02-01.zip data/gtfs_feeds/2018-02-01.zip
TransitFeeds¶
You can always turn to the great TransitFeeds archive and search for the feed you want (MoT's feed is archived every about 2 weeks over there).
Creating a partridge
feed¶
We have a util function for getting a partridge
feed object by date.
feed = get_partridge_feed_by_date(LOCAL_GTFS_ZIP_PATH, datetime.date.today())
type(feed)
- Another option would be to use
ptg.get_representative_feed()
which finds the busiest day of the gtfs file and returns a feed for that day. Not showing this here.
The feed has in it all the (standard) files in the original GTFS zip, as pandas DataFrames.
[x for x in dir(feed) if not x.startswith('_')]
Figuring out geographical zones requires using another zip file on MoT's FTP, using get_zones_df()
. Which returns a simple mapping stop_code
-> (Hebrew) zone_name
in a DataFrame as well.
zones = get_zones_df(LOCAL_TARIFF_PATH)
zones.head()
Tidy DataFrame¶
A (monstrous) merged DataFrame for fancy analysis can be got using get_tidy_feed_df()
, whom you pass a partridge feed and extra dataframes you want to merge to it (only zones
is used here).
This takes a few minutes (MoT's GTFS is big)
f = get_tidy_feed_df(feed, [zones])
and what you get is this:
f.head()
In the future I intend to make this more customizable (field selection, transformations and more).
f.shape
feed.stop_times.shape
So we truly have all the stop times for one whole day of trips.
Random examples¶
f.set_index('arrival_time').resample('10T').size().plot()
zone_counts = (f.set_index('arrival_time')
.groupby([pd.Grouper(freq='10T'), 'zone_name'])
.size().reset_index()
.rename(columns={0: 'trips'})
.assign(time = lambda x: pd.DatetimeIndex(x.arrival_time+datetime.date.today()))
)
lines = alt.Chart(zone_counts).mark_line().encode(
x = alt.X('time:T', axis=alt.Axis(format='%H:%M')),
y = 'trips',
color = alt.Color('zone_name', legend=None),
tooltip = 'zone_name'
).interactive()
annotation = alt.Chart(zone_counts).mark_text(
align='left',
baseline='middle',
fontSize = 12,
dx = 4
).encode(
x='time:T',
y='trips',
text='zone_name'
).transform_filter(
((alt.datum.trips==8785) | (alt.datum.trips == 5347) | (alt.datum.trips == 1985))
)
(lines + annotation).properties(height=400, width=600)
Comments !