musa-509-fall-2021 · johnatawnclementawn · Oct 3, 2021 · Oct 3, 2021 · Oct 10, 2021 · Oct 10, 2021
diff --git a/A2_Q6_queryResults.PNG b/A2_Q6_queryResults.PNG
diff --git a/A2_Q6_queryTime.PNG b/A2_Q6_queryTime.PNG
diff --git a/A2_Q7_queryTime.PNG b/A2_Q7_queryTime.PNG
diff --git a/A2_Q7_results.PNG b/A2_Q7_results.PNG
diff --git a/README.md b/README.md
@@ -22,9 +22,14 @@
 
 ## Questions
 
-1. Which bus stop has the largest population within 800 meters? As a rough estimation, consider any block group that intersects the buffer as being part of the 800 meter buffer.
+[1. Which bus stop has the largest population within 800 meters?](query01.sql)
+As a rough estimation, consider any block group that intersects the buffer as being part of the 800 meter buffer.
 
-2. Which bus stop has the smallest population within 800 meters?
+|stop_name|Population|the_geom|
+|:---:|:---:|:---:|
+|"Passyunk Av & 15th St"|50867|"0101000020E6100000B1C398F4F7CA52C0D0807A336AF64340"|
+
+[2. Which bus stop has the smallest population within 800 meters?](query02.sql)
 
   **The queries to #1 & #2 should generate relations with a single row, with the following structure:**
 
@@ -35,8 +40,12 @@
       the_geom geometry(Point, 4326) -- The geometry of the bus stop
   )
   ```
+|stop_name|Population|the_geom|
+|:---:|:---:|:---:|
+|"Charter Rd & Norcom Rd"|2|"0101000020E6100000C896E5EB32C052C0DF3312A1110C4440"|
 
-3. Using the Philadelphia Water Department Stormwater Billing Parcels dataset, pair each parcel with its closest bus stop. The final result should give the parcel address, bus stop name, and distance apart in meters. Order by distance (largest on top).
+[3. Using the Philadelphia Water Department Stormwater Billing Parcels dataset, pair each parcel with its closest bus stop. The final result should give the parcel address, bus stop name, and distance apart in meters.](query0.sql)
+Order by distance (largest on top).
 
   **Structure:**
   ```sql
@@ -46,8 +55,17 @@
       distance_m double precision  -- The distance apart in meters
   )
   ```
-
-4. Using the _shapes.txt_ file from GTFS bus feed, find the **two** routes with the longest trips. In the final query, give the `trip_headsign` that corresponds to the `shape_id` of this route and the length of the trip.
+|address|stop_name|distance_m|
+|:---:|:---:|:---:|
+|"170 SPRING LN"|"Ridge Av & Ivins Rd"|1658.7873935682778|
+|"150 SPRING LN"|"Ridge Av & Ivins Rd"|1620.287986054119|
+|"130 SPRING LN"|"Ridge Av & Ivins Rd"|1610.9941677070408|
+|"190 SPRING LN"|"Ridge Av & Ivins Rd"|1490.0758681771356|
+|"630 ST ANDREW RD"|"Germantown Av & Springfield Av"|1418.391081836291|
+|...|...|...|
+
+[4. Using the _shapes.txt_ file from GTFS bus feed, find the **two** routes with the longest trips.](query04.sql)
+In the final query, give the `trip_headsign` that corresponds to the `shape_id` of this route and the length of the trip.
 
   **Structure:**
   ```sql
@@ -56,15 +74,39 @@
       trip_length double precision  -- Length of the trip in meters
   )
   ```
+|trip_headsign|trip_length|
+|:---:|:---:|
+|"Bucks County Community College"|46504.13530588818|
+|NULL: no trip_headsign for 266697|45331.46753203432|
 
-5. Rate neighborhoods by their bus stop accessibility for wheelchairs. Use Azavea's neighborhood dataset from OpenDataPhilly along with an appropriate dataset from the Septa GTFS bus feed. Use the [GTFS documentation](https://gtfs.org/reference/static/) for help. Use some creativity in the metric you devise in rating neighborhoods. Describe your accessibility metric:
+[5. Rate neighborhoods by their bus stop accessibility for wheelchairs.](query05.sql)
+Use Azavea's neighborhood dataset from OpenDataPhilly along with an appropriate dataset from the Septa GTFS bus feed. Use the [GTFS documentation](https://gtfs.org/reference/static/) for help. Use some creativity in the metric you devise in rating neighborhoods. Describe your accessibility metric:
 
   **Description:**
-
-6. What are the _top five_ neighborhoods according to your accessibility metric?
-
-7. What are the _bottom five_ neighborhoods according to your accessibility metric?
-
+    The basic measure of accessibility is the equation  A_i= ∑ O_j  *  d_ij^(-b) (where X_y denotes that y is a subscript of X)
+    The equation describes the accessibility of an individual where the accessibility of the individual, A_i, 
+    is calculated by finding the sum of all quality opportunities (such as jobs),  
+    O_j, multiplied by the separation of those opportunities from the individual’s starting location, 
+    d_ij – which can be measured in distance, time, or a monetary cost, exponentiated by the degree to which accessibility to that opportunity declines with increasing separation.
+
+    Job opportunities will be substituted for parcels (potential dwellings) -> (O_j)
+    A rule-of-thumb used by transportation planners is that people are generally willing to walk up to 0.5 miles to access transit.
+    Since we are measuring wheelchair accessibility, we will measure the number of opportunities  within 500 feet (152.5 meters) of each wheelchair accessible bus stop -> (d_ij)
+
+    This index will be aggregated at the neighborhood level, and paired with a count of the wheelchair accessible stops in each neighborhood.
+
+[6. What are the _top five_ neighborhoods according to your accessibility metric?](query06.sql)
+[Screenshot of answer - queries take 45 minutes to run](A2_Q6_queryResults.PNG)
+|neighborhood_name|accessibility_metric|num_bus_stops_accessible|num_bus_stops_inaccessible|
+|:---:|:---:|:---:|:---:|
+|COBBS_CREEK|10282|123|10|
+|POINT_BREEZE|8943|83|0|
+|OLNEY|8960|172|0|
+|RICHMOND|8359|116|0|
+|WEST_OAK_LANE|7889|124|0|
+
+[7. What are the _bottom five_ neighborhoods according to your accessibility metric?](query07.sql)
+[Screenshot of answer - queries take 45 minutes to run](A2_Q7_results.PNG)
   **Both #6 and #7 should have the structure:**
   ```sql
   (
@@ -74,26 +116,44 @@
     num_bus_stops_inaccessible integer
   )
   ```
+|neighborhood_name|accessibility_metric|num_bus_stops_accessible|num_bus_stops_inaccessible|
+|:---:|:---:|:---:|:---:|
+|"WEST_PARK"|0|28|0|
+|"BARTRAM_VILLAGE"|0|0|14|
+|"PENNYPACK_PARK"|0|22|0|
+|"MECHANICSVILLE"|0|0|0|
+|"WEST_TORRESDALE"|2|1|0|
 
-8. With a query, find out how many census block groups Penn's main campus fully contains. Discuss which dataset you chose for defining Penn's campus.
+[8. With a query, find out how many census block groups Penn's main campus fully contains.](query08.sql)
+Discuss which dataset you chose for defining Penn's campus.
 
   **Structure (should be a single value):**
   ```sql
   (
       count_block_groups integer
   )
   ```
+|count_block_groups|
+|:---:|
+|1|
 
-9. With a query involving PWD parcels and census block groups, find the `geo_id` of the block group that contains Meyerson Hall. ST_MakePoint() and functions like that are not allowed.
+[9. With a query involving PWD parcels and census block groups, find the `geo_id` of the block group that contains Meyerson Hall.](query09.sql) 
+ ST_MakePoint() and functions like that are not allowed.
 
   **Structure (should be a single value):**
   ```sql
   (
       geo_id text
   )
   ```
+|geo_id|
+|:---:|
+|421010369001|
 
-10. You're tasked with giving more contextual information to rail stops to fill the `stop_desc` field in a GTFS feed. Using any of the data sets above, PostGIS functions (e.g., `ST_Distance`, `ST_Azimuth`, etc.), and PostgreSQL string functions, build a description (alias as `stop_desc`) for each stop. Feel free to supplement with other datasets (must provide link to data used so it's reproducible), and other methods of describing the relationships. PostgreSQL's `CASE` statements may be helpful for some operations.
+[10. You're tasked with giving more contextual information to rail stops to fill the `stop_desc` field in a GTFS feed.](query10.sql) 
+ Using any of the data sets above, PostGIS functions (e.g., `ST_Distance`, `ST_Azimuth`, etc.), and PostgreSQL string functions, build a description (alias as `stop_desc`) for each stop. Feel free to supplement with other datasets (must provide link to data used so it's reproducible), and other methods of describing the relationships. PostgreSQL's `CASE` statements may be helpful for some operations.
+ As an example, your `stop_desc` for a station stop may be something like "37 meters NE of 1234 Market St" (that's only an example, feel free to be creative, silly, descriptive, etc.)
+ **Tip when experimenting:** Use subqueries to limit your query to just a few rows to keep query times faster. Once your query is giving you answers you want, scale it up. E.g., instead of `FROM tablename`, use `FROM (SELECT * FROM tablename limit 10) as t`.
 
   **Structure:**
   ```sql
@@ -105,7 +165,14 @@
       stop_lat double precision
   )
   ```
-
-  As an example, your `stop_desc` for a station stop may be something like "37 meters NE of 1234 Market St" (that's only an example, feel free to be creative, silly, descriptive, etc.)
-
-  **Tip when experimenting:** Use subqueries to limit your query to just a few rows to keep query times faster. Once your query is giving you answers you want, scale it up. E.g., instead of `FROM tablename`, use `FROM (SELECT * FROM tablename limit 10) as t`.
+I decided to list the closest bus station to the rail station and which bus routes that station serves. This method is flawed in that 
+it accounts for multiple bus routes that service the same bus stop.
+
+|stop_id|stop_name|stop_desc|stop_lon|stop_lat|
+|:---:|:---:|:---:|:---:|:---:|
+|91004|"30th St Lower Level"|"The closest bus stop is 33rd St & Race St and is 84.56 meters away. It is serviced by the City Hall to 76th-City route."|-75.1883333|39.9591667|
+|90004|"30th Street Station"|"The closest bus stop is  and is 168.65 meters away. It is serviced by the  route."|-75.1816667|39.9566667|
+|90314|"49th Street"|"The closest bus stop is 49th St & Chester Av - FS and is 46.76 meters away. It is serviced by the 50th-Parkside to Pier 70 route."|-75.2166667|39.9436111|
+|90539|"9TH Street Lansdale"|"The closest bus stop is Broad St & Hatfield St - FS and is 259.03 meters away. It is serviced by the Telford to Montgomery Mall route."|-75.2791667|40.25|
+|90404|"Airport Terminal A"|"The closest bus stop is  and is 142.09 meters away. It is serviced by the  route."|-75.2452778|39.8761111|
+|...|...|...|...|...|
diff --git a/createTbls.sql b/createTbls.sql
@@ -0,0 +1,165 @@
+-- CREATE EXTENSION postgis; --
+
+-- PHL Bus stops --
+DROP TABLE IF EXISTS septa_bus_stops;
+CREATE TABLE septa_bus_stops (
+    stop_id				NUMERIC(7) PRIMARY KEY NOT NULL,
+    stop_name			VARCHAR(65) NOT NULL, 
+	stop_lat			FLOAT NOT NULL,
+    stop_lon			FLOAT NOT NULL,
+    location_type		NUMERIC(3),
+	parent_station		NUMERIC(7),
+	zone_id				NUMERIC(3),
+	wheelchair_boarding	NUMERIC(3)
+);
+
+-- Import data into bus stop table --
+COPY septa_bus_stops(stop_id, stop_name, stop_lat, stop_lon, location_type, parent_station, zone_id, wheelchair_boarding) 
+FROM 'C:\Users\Public\CloudComputing_data\google_bus\stops.csv' 
+DELIMITER ',' 
+CSV HEADER;
+
+-- Add geometry field to bus stop data --
+ALTER TABLE septa_bus_stops ADD COLUMN the_geom geometry(Point, 4326);
+UPDATE septa_bus_stops SET the_geom = ST_SetSRID(ST_MakePoint(stop_lon, stop_lat),4326);
+
+
+-- PHL Bus Stop times --
+DROP TABLE IF EXISTS septa_bus_stopTimes;
+CREATE TABLE septa_bus_stopTimes (
+	trip_id 			NUMERIC,
+	-- USE VARCHAR for times here b/c there are values >24:00:00, and idk how to ignore/fix those --
+	arrival_time		VARCHAR,
+	departure_time		VARCHAR,
+	stop_id				NUMERIC,
+	stop_sequence		NUMERIC
+);
+
+-- Import data into bus trips table --
+COPY septa_bus_stopTimes 
+FROM 'C:\Users\Public\CloudComputing_data\google_bus\stop_times.txt' 
+DELIMITER ','
+CSV HEADER;
+
+
+-- PHL Bus Trips --
+DROP TABLE IF EXISTS septa_bus_trips;
+CREATE TABLE septa_bus_trips (
+    route_id			NUMERIC,
+    service_id			NUMERIC,
+	trip_id				NUMERIC PRIMARY KEY NOT NULL,   
+	trip_headsign		VARCHAR(65) NOT NULL, 
+	block_id			NUMERIC,
+	direction_id		NUMERIC,
+	shape_id			NUMERIC
+);
+
+-- Import data into bus trips table --
+COPY septa_bus_trips 
+FROM 'C:\Users\Public\CloudComputing_data\google_bus\trips.csv' 
+DELIMITER ',' 
+CSV HEADER;
+
+
+-- PHL Bus Routes --
+DROP TABLE IF EXISTS septa_bus_routes;
+CREATE TABLE septa_bus_routes (
+    route_id			VARCHAR,
+    route_short_name	VARCHAR, 
+	route_long_name		VARCHAR, 
+	route_type			NUMERIC,
+	route_color			VARCHAR,
+	route_text_color	VARCHAR,
+	route_url			VARCHAR
+);
+
+-- Import data into bus routes table --
+COPY septa_bus_routes 
+FROM 'C:\Users\Public\CloudComputing_data\google_bus\routes.csv' 
+DELIMITER ',' 
+CSV HEADER;
+
+
+
+-- Septa Bus shapes --
+DROP TABLE IF EXISTS septa_bus_shapes;
+CREATE TABLE septa_bus_shapes (
+    shape_id			NUMERIC(7) NOT NULL,
+	shape_pt_lat		FLOAT NOT NULL,
+    shape_pt_lon		FLOAT NOT NULL,
+    shape_pt_sequence	NUMERIC(5)
+);
+
+COPY septa_bus_shapes(shape_id, shape_pt_lat, shape_pt_lon,shape_pt_sequence) 
+FROM 'C:\Users\Public\CloudComputing_data\google_bus\shapes.csv' 
+DELIMITER ',' 
+CSV HEADER;
+
+-- Add geometry field to bus routes data --
+ALTER TABLE septa_bus_shapes ADD COLUMN the_geom geometry(Point, 4326);
+UPDATE septa_bus_shapes SET the_geom = ST_SetSRID(ST_MakePoint(shape_pt_lon, shape_pt_lat),4326);
+
+
+
+-- SEPTA rail stops --
+DROP TABLE IF EXISTS septa_rail_stops;
+CREATE TABLE septa_rail_stops(
+    stop_id 	numeric,
+    stop_name 	text,
+    stop_desc 	text,
+    stop_lat 	numeric,
+    stop_lon 	numeric,
+    zone_id 	text,
+    stop_url 	text
+);
+
+COPY septa_rail_stops 
+FROM 'C:\Users\Public\CloudComputing_data\google_rail\stops.csv' 
+DELIMITER ',' 
+CSV HEADER;
+
+-- Add geometry field to bus routes data --
+ALTER TABLE septa_rail_stops ADD COLUMN the_geom geometry(Point, 4326);
+UPDATE septa_rail_stops SET the_geom = ST_SetSRID(ST_MakePoint(stop_lon, stop_lat),4326);
+
+
+
+-- PHL Census Block Group Population join w/ census_block_groups_2010 --
+DROP TABLE IF EXISTS population;
+CREATE TABLE population (
+    id		VARCHAR(23) PRIMARY KEY NOT NULL,
+    name	VARCHAR(75) NOT NULL, 
+	total	NUMERIC(7) NOT NULL
+);
+
+COPY population(id, name, total) 
+FROM 'C:\Users\Public\CloudComputing_data\PHL_2010_blockGroupPopulation\phl_2010_blockGroup_population.csv' 
+DELIMITER ',' 
+CSV HEADER;
+
+
+-- Edit block group shp geom column name & set its crs --
+-- ALTER TABLE census_block_groups RENAME COLUMN geom TO the_geom;
+--UPDATE census_block_groups SET the_geom = ST_Transform(ST_SetSRID(the_geom, 4326),32129);
+
+-- Edit parcels shp geom column name & set its crs --
+-- ALTER TABLE pwd_parcels RENAME COLUMN geom TO the_geom;
+-- UPDATE pwd_parcels SET the_geom = ST_Transform(ST_SetSRID(the_geom, 4326),32129);
+
+
+--ALTER TABLE neighborhoods_philadelphia RENAME COLUMN geom to the_geom;
+--UPDATE neighborhoods_philadelphia SET the_geom = ST_Transform(ST_SetSRID(the_geom, 2272),32129);
+
+
+
+
+-- Create spatial indices --
+-- DROP INDEX IF EXISTS septa_bus_stops_the_geom_idx;
+-- create index septa_bus_stops_the_geom_idx
+--     on septa_bus_stops
+--     using GiST(st_transform(the_geom, 32129));
+
+-- DROP INDEX IF EXISTS pwd_parcels_the_geom_idx;
+-- CREATE index pwd_parcels_the_geom_idx
+-- 	on pwd_parcels
+-- 	using GiST(the_geom);
diff --git a/query01.sql b/query01.sql
@@ -3,17 +3,15 @@
   estimation, consider any block group that intersects the buffer as being part
   of the 800 meter buffer.
 */
-
-
-create index septa_bus_stops__the_geom__32129__idx
-    on septa_bus_stops
-    using GiST (ST_Transform(the_geom, 32129));
+-- Answer: 
+-- stop_name               Population      the_geom                                      --
+-- "Passyunk Av & 15th St"	50867	"0101000020E6100000B1C398F4F7CA52C0D0807A336AF64340" --
 
 
 with septa_bus_stop_block_groups as (
     select
         s.stop_id,
-        '1500000US' || bg.geoid10 as geo_id
+        '1500000US' || bg.geoid10 as geo_id -- concatenate the state prefix to the Census tract/block group string
     from septa_bus_stops as s
     join census_block_groups as bg
         on ST_DWithin(
@@ -22,21 +20,19 @@ with septa_bus_stop_block_groups as (
             800
         )
 ),
-
 septa_bus_stop_surrounding_population as (
     select
         stop_id,
-        sum(population) as estimated_pop_800m
+        sum(total) as estimated_pop_800m
     from septa_bus_stop_block_groups as s
-    join census_population as p using (geo_id)
+    join population as p on s.geo_id = p.id
     group by stop_id
 )
-
 select
     stop_name,
     estimated_pop_800m,
     the_geom
 from septa_bus_stop_surrounding_population
 join septa_bus_stops using (stop_id)
 order by estimated_pop_800m desc
-limit 1
+limit 1