GTFS Stats
Using a mix of partridge
and gtfstk
with some of my own additions to create daily statistical DataFrames for trips, routes and stops. This will later become a module which we will run on our historical MoT GTFS archive and schedule for nightly runs.
Imports and config¶
In [1]:
# Put these at the top of every notebook, to get automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline
In [92]:
import pandas as pd
import numpy as np
import partridge as ptg
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
import datetime
from gtfs_utils import *
alt.renderers.enable('notebook')
alt.data_transformers.enable('json')
sns.set_style("white")
sns.set_context("talk")
sns.set_palette('Set2', 10)
In [5]:
LOCAL_GTFS_ZIP_PATH = 'data/gtfs_feeds/2018-03-05.zip'
LOCAL_TARIFF_PATH = 'data/sample/latest_tariff.zip'
Creating a partridge
feed¶
We have a util function for getting a partridge
feed object by date.
In [6]:
feed = get_partridge_feed_by_date(LOCAL_GTFS_ZIP_PATH, datetime.date(2018,3,5))
type(feed)
Out[6]:
In [7]:
zones = get_zones_df(LOCAL_TARIFF_PATH)
zones.head()
Out[7]:
Stats¶
trip stats¶
- calculate using partridge feed
- add:
- stop_code, stop_name
- zone
In [8]:
import gtfstk
In [10]:
feed.stop_times.dtypes
Out[10]:
computing trip_stats corresponding to gtfstk.compute_trip_stats()
In [132]:
feed.trips.direction_id.value_counts()
Out[132]:
check whether we have bidirectionals
In [133]:
feed.trips.groupby('route_id').direction_id.nunique().value_counts()
Out[133]:
Nope
In [55]:
f = feed.trips
f = (
f[['route_id', 'trip_id', 'direction_id', 'shape_id']]
.merge(feed.routes[['route_id', 'route_short_name', 'route_type']])
.merge(feed.stop_times)
.merge(feed.stops[['stop_id', 'stop_name', 'stop_lat', 'stop_lon', 'stop_code']])
.merge(zones)
.sort_values(['trip_id', 'stop_sequence'])
#.assign(departure_time=lambda x: x['departure_time'].map(
# hp.timestr_to_seconds)
# )
)
In [58]:
f.head().T
Out[58]:
In [68]:
geometry_by_stop = gtfstk.build_geometry_by_stop(feed, use_utm=True)
In [69]:
g = f.groupby('trip_id')
In [70]:
from collections import OrderedDict
def my_agg(group):
d = OrderedDict()
d['route_id'] = group['route_id'].iat[0]
d['route_short_name'] = group['route_short_name'].iat[0]
d['route_type'] = group['route_type'].iat[0]
d['direction_id'] = group['direction_id'].iat[0]
d['shape_id'] = group['shape_id'].iat[0]
d['num_stops'] = group.shape[0]
d['start_time'] = group['departure_time'].iat[0]
d['end_time'] = group['departure_time'].iat[-1]
d['start_stop_id'] = group['stop_id'].iat[0]
d['end_stop_id'] = group['stop_id'].iat[-1]
d['start_stop_code'] = group['stop_code'].iat[0]
d['end_stop_code'] = group['stop_code'].iat[-1]
d['start_stop_name'] = group['stop_name'].iat[0]
d['end_stop_name'] = group['stop_name'].iat[-1]
d['start_zone'] = group['zone_name'].iat[0]
d['end_zone'] = group['zone_name'].iat[-1]
dist = geometry_by_stop[d['start_stop_id']].distance(
geometry_by_stop[d['end_stop_id']])
d['is_loop'] = int(dist < 400)
d['duration'] = (d['end_time'] - d['start_time'])/3600
return pd.Series(d)
In [71]:
h = g.apply(my_agg)
In [72]:
h['distance'] = g.shape_dist_traveled.max()
In [73]:
# Reset index and compute final stats
h = h.reset_index()
h['speed'] = h['distance'] / h['duration'] / 1000
h[['start_time', 'end_time']] = (
h[['start_time', 'end_time']].applymap(
lambda x: gtfstk.helpers.timestr_to_seconds(x, inverse=True))
)
In [74]:
h.sort_values(by='speed', ascending=False).head().T
Out[74]:
In [76]:
h[h.is_loop==1].sort_values(by='duration', ascending=False).head().T
Out[76]:
Route stats¶
This is mostly taken from gtfstk.compute_route_stats_base()
, with some additions:
- first start_stop_id and end_stop_id
- first start_zone and end_zone
- first number of stops
In [ ]:
"""
Compute stats for the given subset of trips stats.
Parameters
----------
trip_stats_subset : DataFrame
Subset of the output of :func:`.trips.compute_trip_stats`
split_directions : boolean
If ``True``, then separate the stats by trip direction (0 or 1);
otherwise aggregate trips visiting from both directions.
Default: ``False``
headway_start_time : string
HH:MM:SS time string indicating the start time for computing
headway stats
Default: ``'07:00:00'``
headway_end_time : string
HH:MM:SS time string indicating the end time for computing
headway stats.
Default: ``'19:00:00'``
Returns
-------
DataFrame
Columns are
- ``'route_id'``
- ``'route_short_name'``
- ``'agency_id'``
- ``'agency_name'``
- ``'route_long_name'``
- ``'route_type'``
- ``'direction_id'``: 1/0
- ``'num_trips'``: number of trips on the route in the subset
- ``'num_trip_starts'``: number of trips on the route with
nonnull start times
- ``'num_trip_ends'``: number of trips on the route with nonnull
end times that end before 23:59:59
- ``'is_loop'``: 1 if at least one of the trips on the route has
its ``is_loop`` field equal to 1; 0 otherwise
- ``'is_bidirectional'``: 1 if the route has trips in both
directions; 0 otherwise
- ``'start_time'``: start time of the earliest trip on the route
- ``'end_time'``: end time of latest trip on the route
- ``'max_headway'``: maximum of the durations (in minutes)
between trip starts on the route between
``headway_start_time`` and ``headway_end_time`` on the given
dates
- ``'min_headway'``: minimum of the durations (in minutes)
mentioned above
- ``'mean_headway'``: mean of the durations (in minutes)
mentioned above
- ``'peak_num_trips'``: maximum number of simultaneous trips in
service (for the given direction, or for both directions when
``split_directions==False``)
- ``'peak_start_time'``: start time of first longest period
during which the peak number of trips occurs
- ``'peak_end_time'``: end time of first longest period during
which the peak number of trips occurs
- ``'service_duration'``: total of the duration of each trip on
the route in the given subset of trips; measured in hours
- ``'service_distance'``: total of the distance traveled by each
trip on the route in the given subset of trips; measured in
whatever distance units are present in ``trip_stats_subset``;
contains all ``np.nan`` entries if ``feed.shapes is None``
- ``'service_speed'``: service_distance/service_duration;
measured in distance units per hour
- ``'mean_trip_distance'``: service_distance/num_trips
- ``'mean_trip_duration'``: service_duration/num_trips
- ``'start_stop_id'``: ``start_stop_id`` of the first trip for the route
- ``'end_stop_id'``: ``end_stop_id`` of the first trip for the route
- ``'num_stops'``: ``num_stops`` of the first trip for the route
- ``'start_zone'``: ``start_zone`` of the first trip for the route
- ``'end_zone'``: ``end_zone`` of the first trip for the route
If not ``split_directions``, then remove the
direction_id column and compute each route's stats,
except for headways, using
its trips running in both directions.
In this case, (1) compute max headway by taking the max of the
max headways in both directions; (2) compute mean headway by
taking the weighted mean of the mean headways in both
directions.
If ``trip_stats_subset`` is empty, return an empty DataFrame.
"""
In [103]:
headway_start_time='07:00:00'
headway_end_time='19:00:00'
# Convert trip start and end times to seconds to ease calculations below
f = h.copy()
f[['start_time', 'end_time']] = f[['start_time', 'end_time']
].applymap(gtfstk.helpers.timestr_to_seconds)
headway_start = gtfstk.helpers.timestr_to_seconds(headway_start_time)
headway_end = gtfstk.helpers.timestr_to_seconds(headway_end_time)
In [114]:
def compute_route_stats(group):
d = OrderedDict()
d['route_short_name'] = group['route_short_name'].iat[0]
d['route_type'] = group['route_type'].iat[0]
d['num_trips'] = group.shape[0]
d['num_trip_starts'] = group['start_time'].count()
d['num_trip_ends'] = group.loc[
group['end_time'] < 24*3600, 'end_time'].count()
d['is_loop'] = int(group['is_loop'].any())
d['is_bidirectional'] = int(group['direction_id'].unique().size > 1)
d['start_time'] = group['start_time'].min()
d['end_time'] = group['end_time'].max()
# Compute headway stats
headways = np.array([])
for direction in [0, 1]:
stimes = group[group['direction_id'] == direction][
'start_time'].values
stimes = sorted([stime for stime in stimes
if headway_start <= stime <= headway_end])
headways = np.concatenate([headways, np.diff(stimes)])
if headways.size:
d['max_headway'] = np.max(headways)/60 # minutes
d['min_headway'] = np.min(headways)/60 # minutes
d['mean_headway'] = np.mean(headways)/60 # minutes
else:
d['max_headway'] = np.nan
d['min_headway'] = np.nan
d['mean_headway'] = np.nan
# Compute peak num trips
times = np.unique(group[['start_time', 'end_time']].values)
counts = [gtfstk.helpers.count_active_trips(group, t) for t in times]
start, end = gtfstk.helpers.get_peak_indices(times, counts)
d['peak_num_trips'] = counts[start]
d['peak_start_time'] = times[start]
d['peak_end_time'] = times[end]
d['service_distance'] = group['distance'].sum()
d['service_duration'] = group['duration'].sum()
# Added by cjer
d['start_stop_id'] = group['start_stop_id'].iat[0]
d['end_stop_id'] = group['end_stop_id'].iat[0]
d['num_stops'] = group['num_stops'].iat[0]
d['start_zone'] = group['start_zone'].iat[0]
d['end_zone'] = group['end_zone'].iat[0]
return pd.Series(d)
In [115]:
g = f.groupby('route_id').apply(
compute_route_stats).reset_index()
# Compute a few more stats
g['service_speed'] = g['service_distance']/g['service_duration']
g['mean_trip_distance'] = g['service_distance']/g['num_trips']
g['mean_trip_duration'] = g['service_duration']/g['num_trips']
In [116]:
# Convert route times to time strings
g[['start_time', 'end_time', 'peak_start_time', 'peak_end_time']] =\
g[['start_time', 'end_time', 'peak_start_time', 'peak_end_time']].\
applymap(lambda x: gtfstk.helpers.timestr_to_seconds(x, inverse=True))
In [117]:
g['service_speed'] = g.service_speed/1000
In [118]:
g.sort_values(by='num_trips', ascending=False).head(10).T
Out[118]:
Add stuff¶
- agency_id, agency_name
- route_long_name
In [119]:
g = (g
.merge(feed.routes[['route_id', 'route_long_name', 'agency_id']], how='left', on='route_id')
.merge(feed.agency[['agency_id', 'agency_name']], how='left', on='agency_id')
)
In [120]:
g = g[['route_id', 'route_short_name', 'agency_id', 'agency_name', 'route_long_name', 'route_type',
'num_trips', 'num_trip_starts', 'num_trip_ends', 'is_loop',
'is_bidirectional', 'start_time', 'end_time', 'max_headway', 'min_headway',
'mean_headway', 'peak_num_trips', 'peak_start_time', 'peak_end_time',
'service_distance', 'service_duration', 'service_speed',
'mean_trip_distance', 'mean_trip_duration', 'start_stop_id',
'end_stop_id', 'num_stops', 'start_zone', 'end_zone',
]]
In [135]:
g.sort_values(by='peak_num_trips', ascending=False).head(10).T
Out[135]:
In [138]:
g.sort_values(by='peak_num_trips', ascending=False).head(10).T.to_csv('180305_route_stats_top10_peak_num_trips.csv')
In [125]:
g.is_bidirectional.value_counts()
Out[125]:
What's next¶
TODO
- add split_directions
- time between stops - max, min, mean (using delta)
- integrate with custom day cutoff
- add day and night headways and num_trips (maybe noon also)
- put this all back into proper documented functions
- write tests
Comments !