MCQs on Data Catalog and Schema Management | AWS Glue MCQs Question

AWS Glue simplifies the process of preparing and analyzing data. Chapter 3 focuses on the AWS Glue Data Catalog, schema discovery with crawlers, and managing metadata and partitions. These AWS Glue MCQ questions and answers will help you deepen your knowledge and prepare for real-world data engineering tasks and AWS certifications.


Multiple-Choice Questions (MCQs)

AWS Glue Data Catalog

  1. What is the primary purpose of the AWS Glue Data Catalog?
    a) To store raw data files
    b) To provide metadata for datasets
    c) To visualize data analytics
    d) To run machine learning models
  2. AWS Glue Data Catalog can be integrated with which of the following services?
    a) Amazon S3
    b) Amazon Redshift
    c) Amazon Athena
    d) All of the above
  3. In the AWS Glue Data Catalog, a database is:
    a) A collection of tables and metadata
    b) A storage location for raw data
    c) A compute engine for processing data
    d) A schema for organizing columns
  4. What is required to query a dataset using AWS Athena?
    a) An AWS Glue table registered in the Data Catalog
    b) An S3 bucket with raw data
    c) A running EMR cluster
    d) An IAM role with S3 access
  5. How does the AWS Glue Data Catalog manage schema versions?
    a) Automatically creates a new version for each schema change
    b) Overwrites the schema without versioning
    c) Stores only the latest schema version
    d) Requires manual schema updates

Schema Discovery and Crawlers

  1. What is the purpose of an AWS Glue crawler?
    a) To transform data into a new format
    b) To discover metadata and create table definitions
    c) To move data between databases
    d) To monitor data pipeline performance
  2. Which input source is NOT supported by AWS Glue crawlers?
    a) Amazon S3
    b) Amazon RDS
    c) Local file systems
    d) DynamoDB
  3. How does a crawler identify partitions in an S3 bucket?
    a) By analyzing file content
    b) By examining folder structure and naming conventions
    c) By querying the S3 API
    d) By using AWS CloudTrail logs
  4. What happens when a crawler detects schema inconsistencies?
    a) It stops and raises an error
    b) It creates a new table
    c) It updates the table schema with the latest structure
    d) It ignores the inconsistencies
  5. What can you do to speed up a crawler’s runtime?
    a) Use larger instance sizes
    b) Reduce the number of data sources
    c) Provide specific include and exclude patterns
    d) Increase the crawler timeout

Managing Metadata and Partitions

  1. What is metadata in the context of AWS Glue?
    a) The actual data stored in S3
    b) Information describing datasets, such as schema and location
    c) The compute resources for data processing
    d) A data encryption mechanism
  2. In AWS Glue, what is a partition?
    a) A way to divide datasets for parallel processing
    b) A storage container for metadata
    c) A schema definition in the Data Catalog
    d) A data format supported by Glue
  3. Why are partitions useful in AWS Glue?
    a) They improve query performance by filtering data
    b) They simplify schema updates
    c) They allow storing data in multiple formats
    d) They reduce storage costs
  4. How can partitions be added to an existing table in the Data Catalog?
    a) Manually add entries in the AWS Management Console
    b) Run a crawler on the data source
    c) Use a Glue ETL job to update partitions
    d) All of the above
  5. Which tool can you use to edit table metadata in the Data Catalog?
    a) AWS Management Console
    b) AWS Command Line Interface (CLI)
    c) AWS SDK
    d) All of the above

Scenario-Based Questions

  1. A dataset is updated frequently, with new data added to different folders in S3. What’s the best way to keep the Data Catalog updated?
    a) Run a Glue crawler periodically
    b) Use Lambda to update the catalog
    c) Manually edit the table definitions
    d) Use DynamoDB streams
  2. When a table schema changes frequently, how can you ensure the Data Catalog remains consistent?
    a) Configure the crawler to update schemas automatically
    b) Manually adjust schema definitions
    c) Stop schema versioning
    d) Use a fixed schema
  3. A crawler detects new partitions in an S3 bucket but does not update the table schema. What could be the issue?
    a) The IAM role lacks proper permissions
    b) The crawler is misconfigured
    c) The table is locked by another process
    d) The dataset contains unsupported file types
  4. A Glue ETL job writes partitioned data to S3. What must be done to make these partitions queryable?
    a) Register them in the Data Catalog
    b) Compress the partitions
    c) Create a new Glue database
    d) Set up a DynamoDB table
  5. When should you manually create table metadata in the Data Catalog instead of using a crawler?
    a) When the data format is unsupported by crawlers
    b) When schema changes are infrequent
    c) When the dataset is small
    d) Always use a crawler
  6. If you need to track schema changes over time, which feature should you use?
    a) Schema versioning
    b) Data partitioning
    c) ETL job logs
    d) Table locking
  7. How does AWS Glue handle hierarchical data like JSON?
    a) It flattens the data into relational tables
    b) It discards nested fields
    c) It supports nested schemas without modification
    d) It converts it into binary formats like Avro
  8. A dataset stored in S3 has multiple file formats. How can you create a unified schema in Glue?
    a) Use a crawler to infer the schema
    b) Convert all files to a single format
    c) Use Glue ETL jobs to harmonize the data
    d) Query each file format separately
  9. What happens when you delete a table from the Data Catalog?
    a) The underlying data in S3 is deleted
    b) Only the metadata is deleted
    c) All partitions are deleted but not the schema
    d) Nothing happens; it requires manual confirmation
  10. What is a common use case for Glue DynamicFrames?
    a) Handling semi-structured data
    b) Creating schemas in the Data Catalog
    c) Storing unstructured files in S3
    d) Querying data using SQL

Answers

QNoAnswer
1b) To provide metadata for datasets
2d) All of the above
3a) A collection of tables and metadata
4a) An AWS Glue table registered in the Data Catalog
5a) Automatically creates a new version for each schema change
6b) To discover metadata and create table definitions
7c) Local file systems
8b) By examining folder structure and naming conventions
9c) It updates the table schema with the latest structure
10c) Provide specific include and exclude patterns
11b) Information describing datasets, such as schema and location
12a) A way to divide datasets for parallel processing
13a) They improve query performance by filtering data
14d) All of the above
15d) All of the above
16a) Run a Glue crawler periodically
17a) Configure the crawler to update schemas automatically
18a) The IAM role lacks proper permissions
19a) Register them in the Data Catalog
20a) When the data format is unsupported by crawlers
21a) Schema versioning
22c) It supports nested schemas without modification
23a) Use a crawler to infer the schema
24b) Only the metadata is deleted
25a) Handling semi-structured data

Use a Blank Sheet, Note your Answers and Finally tally with our answer at last. Give Yourself Score.

X
error: Content is protected !!
Scroll to Top