MCQs on Advanced ETL Techniques | AWS Glue MCQs Question

Explore your knowledge with these AWS Glue MCQ questions and answers tailored to Chapter 5: Advanced ETL Techniques. Covering essential topics like Dynamic Frame Transformations, Handling Semi-Structured and Streaming Data, and Performance Optimization, these questions are designed to strengthen your understanding of AWS Glue. Perfect for mastering ETL workflows and real-world applications.


Chapter 5: Advanced ETL Techniques – AWS Glue MCQs

Topic 1: Dynamic Frame Transformations

  1. What is a DynamicFrame in AWS Glue?
    a) A Spark DataFrame wrapper with additional capabilities
    b) A database table
    c) A JSON file representation
    d) A NoSQL database
  2. Which method is used to convert a DynamicFrame to a DataFrame?
    a) to_pandas()
    b) to_dataframe()
    c) toDF()
    d) convertFrame()
  3. What is the key advantage of using DynamicFrames in AWS Glue?
    a) They handle schema changes gracefully
    b) Faster execution than DataFrames
    c) Built-in data encryption
    d) Automatic cost optimization
  4. Which transformation operation is used to filter data in a DynamicFrame?
    a) filter()
    b) select_fields()
    c) rename_field()
    d) glue_filter()
  5. How do you map a field in a DynamicFrame?
    a) Using apply_mapping()
    b) Using map_field()
    c) Using rename_field()
    d) Using update_mapping()
  6. What does the relationalize() function do in AWS Glue?
    a) Converts semi-structured data to normalized tables
    b) Converts a DataFrame to a CSV file
    c) Creates dynamic columns
    d) Deletes unstructured data
  7. How are null values handled in DynamicFrames?
    a) Automatically dropped
    b) Replaced with default values
    c) Identified but not modified
    d) Replaced with NaN
  8. Which language is primarily used for writing AWS Glue ETL scripts?
    a) SQL
    b) Python
    c) Java
    d) Scala

Topic 2: Handling Semi-Structured and Streaming Data

  1. Which format is commonly used for semi-structured data in AWS Glue?
    a) CSV
    b) JSON
    c) TXT
    d) XML
  2. How does AWS Glue handle nested fields in semi-structured data?
    a) By flattening them automatically
    b) Using relationalize()
    c) Storing them as-is
    d) By converting to CSV
  3. What is AWS Glue’s default classification for JSON files?
    a) relational_json
    b) spark_json
    c) simple_json
    d) semi_structured
  4. Can AWS Glue process streaming data directly?
    a) Yes, through Spark Streaming
    b) No, it only handles batch data
    c) Only if integrated with Amazon Kinesis
    d) Only if stored in S3
  5. What AWS service is commonly used to stream data to AWS Glue?
    a) Amazon S3
    b) Amazon Kinesis
    c) Amazon DynamoDB
    d) AWS Lambda
  6. Which classification does Glue use for semi-structured log data?
    a) XML_log
    b) JSON_log
    c) cloudwatch_log
    d) apache_log
  7. How can AWS Glue infer the schema of semi-structured data?
    a) Through crawlers
    b) Using a predefined schema
    c) By executing SQL queries
    d) By manual input
  8. What feature in AWS Glue helps handle schema evolution in streaming data?
    a) DynamicFrame
    b) DataFrame
    c) Schema Registry
    d) Glue Catalog

Topic 3: Performance Optimization

  1. What is the role of AWS Glue’s Spark UI?
    a) Debugging ETL jobs
    b) Monitoring storage usage
    c) Managing IAM permissions
    d) Viewing database schemas
  2. How can you optimize the performance of AWS Glue jobs?
    a) Increasing DataFrame size
    b) Using job bookmarks
    c) Avoiding schema evolution
    d) Increasing S3 replication
  3. What does the AWS Glue job bookmark feature do?
    a) Prevents reprocessing of already processed data
    b) Automatically saves job configurations
    c) Archives completed jobs
    d) Flags errors in ETL jobs
  4. What is the recommended partitioning format for large datasets in Glue?
    a) Parquet
    b) JSON
    c) CSV
    d) TXT
  5. Which AWS Glue feature can help reduce I/O operations for large datasets?
    a) Partitioning
    b) Relationalize
    c) Job bookmarks
    d) Automatic retries
  6. How does AWS Glue use Apache Spark for performance optimization?
    a) By caching all data in memory
    b) By parallelizing ETL tasks
    c) By compressing all files
    d) By creating redundant copies
  7. What is the default storage location for intermediate data in Glue jobs?
    a) Amazon RDS
    b) Amazon S3
    c) Local memory
    d) Amazon DynamoDB
  8. How do job retries improve performance in AWS Glue?
    a) By reducing data replication
    b) By ensuring fault tolerance
    c) By increasing the number of partitions
    d) By speeding up data transformation
  9. Which AWS Glue configuration helps adjust memory for Spark executors?
    a) MaxConcurrentRuns
    b) DPU allocation
    c) Job parameters
    d) Schema versioning
  10. What is the default DPU allocation for AWS Glue jobs?
    a) 1 DPU
    b) 2 DPUs
    c) 5 DPUs
    d) 10 DPUs
  11. Which partitioning strategy improves Glue query performance?
    a) Small, frequent partitions
    b) Large, consolidated partitions
    c) Balanced partition sizes
    d) Dynamic partitioning
  12. How can you improve Glue’s performance when handling JSON files?
    a) Flatten nested structures
    b) Increase the DPU limit
    c) Use Parquet instead
    d) Reduce the number of tasks
  13. What happens when Glue jobs exhaust their allocated DPUs?
    a) The job fails
    b) The job slows down significantly
    c) Additional DPUs are allocated automatically
    d) The job continues without processing all data
  14. Which AWS service helps monitor AWS Glue job performance?
    a) AWS CloudWatch
    b) Amazon S3
    c) AWS Config
    d) AWS Lambda

Answer Key

QnoAnswer
1a) A Spark DataFrame wrapper with additional capabilities
2c) toDF()
3a) They handle schema changes gracefully
4a) filter()
5a) Using apply_mapping()
6a) Converts semi-structured data to normalized tables
7c) Identified but not modified
8b) Python
9b) JSON
10b) Using relationalize()
11c) simple_json
12a) Yes, through Spark Streaming
13b) Amazon Kinesis
14d) apache_log
15a) Through crawlers
16c) Schema Registry
17a) Debugging ETL jobs
18b) Using job bookmarks
19a) Prevents reprocessing of already processed data
20a) Parquet
21a) Partitioning
22b) By parallelizing ETL tasks
23b) Amazon S3
24b) By ensuring fault tolerance
25b) DPU allocation
26b) 2 DPUs
27c) Balanced partition sizes
28a) Flatten nested structures
29b) The job slows down significantly
30a) AWS CloudWatch

Use a Blank Sheet, Note your Answers and Finally tally with our answer at last. Give Yourself Score.

X
error: Content is protected !!
Scroll to Top