Explore your knowledge with these AWS Glue MCQ questions and answers tailored to Chapter 5: Advanced ETL Techniques. Covering essential topics like Dynamic Frame Transformations, Handling Semi-Structured and Streaming Data, and Performance Optimization, these questions are designed to strengthen your understanding of AWS Glue. Perfect for mastering ETL workflows and real-world applications.
What is a DynamicFrame in AWS Glue? a) A Spark DataFrame wrapper with additional capabilities b) A database table c) A JSON file representation d) A NoSQL database
Which method is used to convert a DynamicFrame to a DataFrame? a) to_pandas() b) to_dataframe() c) toDF() d) convertFrame()
What is the key advantage of using DynamicFrames in AWS Glue? a) They handle schema changes gracefully b) Faster execution than DataFrames c) Built-in data encryption d) Automatic cost optimization
Which transformation operation is used to filter data in a DynamicFrame? a) filter() b) select_fields() c) rename_field() d) glue_filter()
How do you map a field in a DynamicFrame? a) Using apply_mapping() b) Using map_field() c) Using rename_field() d) Using update_mapping()
What does the relationalize() function do in AWS Glue? a) Converts semi-structured data to normalized tables b) Converts a DataFrame to a CSV file c) Creates dynamic columns d) Deletes unstructured data
How are null values handled in DynamicFrames? a) Automatically dropped b) Replaced with default values c) Identified but not modified d) Replaced with NaN
Which language is primarily used for writing AWS Glue ETL scripts? a) SQL b) Python c) Java d) Scala
Topic 2: Handling Semi-Structured and Streaming Data
Which format is commonly used for semi-structured data in AWS Glue? a) CSV b) JSON c) TXT d) XML
How does AWS Glue handle nested fields in semi-structured data? a) By flattening them automatically b) Using relationalize() c) Storing them as-is d) By converting to CSV
What is AWS Glue’s default classification for JSON files? a) relational_json b) spark_json c) simple_json d) semi_structured
Can AWS Glue process streaming data directly? a) Yes, through Spark Streaming b) No, it only handles batch data c) Only if integrated with Amazon Kinesis d) Only if stored in S3
What AWS service is commonly used to stream data to AWS Glue? a) Amazon S3 b) Amazon Kinesis c) Amazon DynamoDB d) AWS Lambda
Which classification does Glue use for semi-structured log data? a) XML_log b) JSON_log c) cloudwatch_log d) apache_log
How can AWS Glue infer the schema of semi-structured data? a) Through crawlers b) Using a predefined schema c) By executing SQL queries d) By manual input
What feature in AWS Glue helps handle schema evolution in streaming data? a) DynamicFrame b) DataFrame c) Schema Registry d) Glue Catalog
Topic 3: Performance Optimization
What is the role of AWS Glue’s Spark UI? a) Debugging ETL jobs b) Monitoring storage usage c) Managing IAM permissions d) Viewing database schemas
How can you optimize the performance of AWS Glue jobs? a) Increasing DataFrame size b) Using job bookmarks c) Avoiding schema evolution d) Increasing S3 replication
What does the AWS Glue job bookmark feature do? a) Prevents reprocessing of already processed data b) Automatically saves job configurations c) Archives completed jobs d) Flags errors in ETL jobs
What is the recommended partitioning format for large datasets in Glue? a) Parquet b) JSON c) CSV d) TXT
Which AWS Glue feature can help reduce I/O operations for large datasets? a) Partitioning b) Relationalize c) Job bookmarks d) Automatic retries
How does AWS Glue use Apache Spark for performance optimization? a) By caching all data in memory b) By parallelizing ETL tasks c) By compressing all files d) By creating redundant copies
What is the default storage location for intermediate data in Glue jobs? a) Amazon RDS b) Amazon S3 c) Local memory d) Amazon DynamoDB
How do job retries improve performance in AWS Glue? a) By reducing data replication b) By ensuring fault tolerance c) By increasing the number of partitions d) By speeding up data transformation
Which AWS Glue configuration helps adjust memory for Spark executors? a) MaxConcurrentRuns b) DPU allocation c) Job parameters d) Schema versioning
What is the default DPU allocation for AWS Glue jobs? a) 1 DPU b) 2 DPUs c) 5 DPUs d) 10 DPUs
Which partitioning strategy improves Glue query performance? a) Small, frequent partitions b) Large, consolidated partitions c) Balanced partition sizes d) Dynamic partitioning
How can you improve Glue’s performance when handling JSON files? a) Flatten nested structures b) Increase the DPU limit c) Use Parquet instead d) Reduce the number of tasks
What happens when Glue jobs exhaust their allocated DPUs? a) The job fails b) The job slows down significantly c) Additional DPUs are allocated automatically d) The job continues without processing all data
Which AWS service helps monitor AWS Glue job performance? a) AWS CloudWatch b) Amazon S3 c) AWS Config d) AWS Lambda
Answer Key
Qno
Answer
1
a) A Spark DataFrame wrapper with additional capabilities
2
c) toDF()
3
a) They handle schema changes gracefully
4
a) filter()
5
a) Using apply_mapping()
6
a) Converts semi-structured data to normalized tables
7
c) Identified but not modified
8
b) Python
9
b) JSON
10
b) Using relationalize()
11
c) simple_json
12
a) Yes, through Spark Streaming
13
b) Amazon Kinesis
14
d) apache_log
15
a) Through crawlers
16
c) Schema Registry
17
a) Debugging ETL jobs
18
b) Using job bookmarks
19
a) Prevents reprocessing of already processed data