In this step, we'll use AWS Glue to build a data catalog that we'll configure to run every hour to ensure that all data ingested into our data lake is automatically cataloged. A data catalog consists of one or more Glue databases each having one or more metadata tables for each dataset. In this demo, we'll create 1 Glue database that will contain 1 table for our dataset that has 1 csv file. There are 2 ways of creating a Glue table: manually, by defining a custom schema or by using a Glue Crawler that crawls the dataset to determine the schema automatically. For this session, we are going to automate the creation of the data catalog by using a crawler.
A crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your data catalog. A classifier determines the schema of your data. You can use the AWS Glue built-in classifiers or write your own. Today, we will use Glue built-in classifiers.
Click on the blue button "Add crawler"
Specify a name. In the screenshot, I have used s6r8n6 as a shorthand for a rather verbose name that I was planning to use earlier: sahays-reinvent-nyctlc. Check "create a single schema for each S3 path" to tell the crawler to group compatible schemas into a single table definition
Use the S3 bucket name where you copied the public dataset in the Step 1.3 above. Tip: You can use the folder icon to browse to the bucket/folder
In the next screen accept the default
Pick an IAM role (existing or new) that this wizard will use to attach AWSGlueServiceRole managed policy to the role besides attaching an inline policy that allows the role access to the S3 data store bucket/folder that you specified in the previous step. For this session, I have chosen to create a new IAM role for the crawler
You have the option to "Run on demand", specify a named schedule e.g. hourly or specify a cron expression for a custom schedule. For this session, I have chosen "Hourly"
Next step is configure the crawler's output to Glue database. We have the option of using an exiting database or adding a new.
We'll add a new database by clicking on the "Add database" button
Finally, review the summary and then click on Finish to start creating the crawler
Click on the "Crawlers" menu in the left side panel to view a list of all the crawlers. Select the crawler that we created and then click on the button "Run crawler". The crawler will now run, and after about 2 minutes it will finish outputting a new Glue database and a metadata table containing a schema that the crawler created after crawling through the data store.
A table is the metadata definition that represents your data, including its schema. Click on the "Tables" menu in the left side panel to view a list of tables.
Then, select the table to view it's details
and finally, review the schema that is generated by the crawler
Select your table from the list of tables and then click on "View Data". This launches Amazon Athena where we can perform queries directly to the S3 data store using the schema generated by the crawler. Execute the following SQL and then take a note of the run time and data scanned because in the next step we'll transform the data into a columnar and compressed data format and we'll see how it helps these numbers.
SELECT count(vendorid) AS vendors, vendorid
FROM "s6r8n6"."nyc_tlc"
group by vendorid
order by vendors desc;