In this homework we'll prepare the environment and practice with Docker and SQL
Run the command to get information on Docker
docker --help
Now run the command to get help on the "docker build" command
Which tag has the following text? - Write the image ID to the file
--imageid string
--iidfile string
--idimage string
--idfile string
Run docker with the python:3.9 image in an interactive mode and the entrypoint of bash. Now check the python modules that are installed ( use pip list). How many python packages/modules are installed? docker run -it --entrypoint=bash python:3.9
- 1
- 6
- 3
- 7
Run Postgres and load data as shown in the videos We'll use the green taxi trips from January 2019:
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-01.csv.gz
You will also need the dataset with zones:
wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
Download this data and put it into Postgres (with jupyter notebooks or with a pipeline)
How many taxi trips were totally made on January 15?
Tip: started and finished on 2019-01-15.
Remember that lpep_pickup_datetime
and lpep_dropoff_datetime
columns are in the format timestamp (date and hour+min+sec) and not in date.
- 20689
- 20530
- 17630
- 21090
Which was the day with the largest trip distance Use the pick up time for your calculations.
- 2019-01-18
- 2019-01-28
- 2019-01-15
- 2019-01-10
In 2019-01-01 how many trips had 2 and 3 passengers?
- 2: 1282 ; 3: 266
- 2: 1532 ; 3: 126
- 2: 1282 ; 3: 254
- 2: 1282 ; 3: 274
For the passengers picked up in the Astoria Zone which was the drop off zone that had the largest tip? We want the name of the zone, not the id.
Note: it's not a typo, it's tip
, not trip
- Central Park
- Jamaica
- South Ozone Park
- Long Island City/Queens Plaza
- Form for submitting: form
- You can submit your homework multiple times. In this case, only the last submission will be used.
Deadline: 30 January (Monday), 22:00 CET
We will publish the solution here
docker image build --help Build an image from a Dockerfile Options: --add-host list Add a custom host-to-IP mapping (host:ip) --build-arg list Set build-time variables --cache-from strings Images to consider as cache sources --cgroup-parent string Optional parent cgroup for the container --compress Compress the build context using gzip --cpu-period int Limit the CPU CFS (Completely Fair Scheduler) period --cpu-quota int Limit the CPU CFS (Completely Fair Scheduler) quota -c, --cpu-shares int CPU shares (relative weight) --cpuset-cpus string CPUs in which to allow execution (0-3, 0,1) --cpuset-mems string MEMs in which to allow execution (0-3, 0,1) --disable-content-trust Skip image verification (default true) -f, --file string Name of the Dockerfile (Default is 'PATH/Dockerfile') --force-rm Always remove intermediate containers --iidfile string Write the image ID to the file --isolation string Container isolation technology --label list Set metadata for an image -m, --memory bytes Memory limit --memory-swap bytes Swap limit equal to memory plus swap: '-1' to enable unlimited swap --network string Set the networking mode for the RUN instructions during build (default "default") --no-cache Do not use cache when building the image --pull Always attempt to pull a newer version of the image -q, --quiet Suppress the build output and print image ID on success --rm Remove intermediate containers after a successful build (default true) --security-opt strings Security options --shm-size bytes Size of /dev/shm -t, --tag list Name and optionally a tag in the 'name:tag' format --target string Set the target build stage to build. --ulimit ulimit Ulimit options (default [])
docker run -it --entrypoint=bash python:3.9 pip list
SELECT CAST(lpep_pickup_datetime AS DATE) AS day_pick, COUNT(1) FROM green_taxi_data g WHERE CAST(lpep_pickup_datetime AS DATE) = '2019-01-15' GROUP BY day_pick ORDER BY day_pick ASC;
SELECT CAST(lpep_pickup_datetime AS DATE) AS day_pick, MAX(trip_distance) FROM green_taxi_data g GROUP BY day_pick ORDER BY max DESC;
SELECT CAST(lpep_pickup_datetime AS DATE) AS day_pick, COUNT(trip_distance) FROM green_taxi_data g WHERE CAST(lpep_pickup_datetime AS DATE) = '2019-01-01' AND passenger_count = 2 GROUP BY day_pick
SELECT CAST(lpep_pickup_datetime AS DATE) AS day_pick, COUNT(trip_distance) FROM green_taxi_data g WHERE CAST(lpep_pickup_datetime AS DATE) = '2019-01-01' AND passenger_count = 3 GROUP BY day_pick
SELECT tip_amount, zpu."Zone" AS pick_up, zdo."Zone" AS drop_off FROM green_taxi_data g JOIN zones zpu ON g."PULocationID" = zpu."LocationID" JOIN zones zdo ON g."DOLocationID" = zdo."LocationID" WHERE zpu."Zone" LIKE 'Astoria' ORDER BY tip_amount DESC