MCQs on HDFS Data Migration | Hadoop HDFS

Master the essentials of HDFS Data Migration, including data ingestion using Apache Flume, integration with Apache Sqoop, inter-cluster data management, and HDFS backup and restore strategies. Test your knowledge now!


Data Ingestion to HDFS: Using Apache Flume

  1. What is Apache Flume primarily used for in HDFS data ingestion?
    • A) Querying data
    • B) Streaming large volumes of log data
    • C) Backing up files
    • D) Managing clusters
  2. Which of the following is a core component of Apache Flume?
    • A) Source, Sink, and Channel
    • B) Producer and Consumer
    • C) Nodes and Mappers
    • D) Reducer and Mapper
  3. In Flume, a channel acts as:
    • A) A storage buffer between the source and sink
    • B) The origin of the data stream
    • C) The endpoint for writing data
    • D) A load balancer
  4. Which Flume component is responsible for writing data to HDFS?
    • A) Source
    • B) Sink
    • C) Channel
    • D) Interceptor
  5. What is the purpose of an interceptor in Apache Flume?
    • A) To filter or modify events
    • B) To split the data stream
    • C) To manage channel capacity
    • D) To optimize cluster performance
  6. Which type of Flume source is used to collect log data from application servers?
    • A) HTTP source
    • B) Exec source
    • C) Syslog source
    • D) Avro source
  7. Which of the following is an advantage of using Flume for data ingestion?
    • A) Real-time querying capabilities
    • B) Scalability for large data streams
    • C) Data compression during transfer
    • D) Built-in SQL support
  8. How does Flume achieve fault tolerance during data ingestion?
    • A) By replicating events across multiple channels
    • B) By re-sending failed events
    • C) By encrypting data streams
    • D) By using checkpoint files

Integrating HDFS with Apache Sqoop for Data Transfer

  1. What is Apache Sqoop used for in Hadoop?
    • A) Importing/exporting structured data between HDFS and relational databases
    • B) Running MapReduce jobs
    • C) Streaming unstructured data into HDFS
    • D) Performing real-time analytics
  2. Which Sqoop command is used to transfer data from a database to HDFS?
    • A) sqoop-import
    • B) sqoop-export
    • C) sqoop-move
    • D) sqoop-transfer
  3. In Sqoop, which parameter specifies the target directory in HDFS?
    • A) –target-path
    • B) –hdfs-dir
    • C) –warehouse-dir
    • D) –output-path
  4. What is the purpose of the --split-by option in Sqoop?
    • A) Splitting files into smaller blocks in HDFS
    • B) Defining the column for parallel data transfer
    • C) Splitting large jobs into smaller tasks
    • D) Specifying the delimiter for file output
  5. Which file format is not supported by Sqoop during data import?
    • A) Avro
    • B) Parquet
    • C) ORC
    • D) XML
  6. How can you optimize the performance of data import using Sqoop?
    • A) By increasing the number of mappers
    • B) By using fewer reducers
    • C) By compressing source files
    • D) By enabling journaling
  7. Which of the following Sqoop commands exports data from HDFS to a relational database?
    • A) sqoop-import-all-tables
    • B) sqoop-export
    • C) sqoop-data-transfer
    • D) sqoop-file-export
  8. What is the role of the --connect parameter in Sqoop commands?
    • A) It specifies the database URL
    • B) It sets the HDFS location
    • C) It defines the mapper class
    • D) It provides the job configuration file

Managing Data Migration between Hadoop Clusters

  1. Which command is typically used to migrate data between Hadoop clusters?
    • A) distcp
    • B) hdfs-transfer
    • C) copyFromLocal
    • D) fs-move
  2. What does the distcp tool use to ensure efficient data migration?
    • A) MapReduce framework
    • B) Spark Streaming
    • C) HDFS replication
    • D) Custom shell scripts
  3. How can distcp handle large file transfers reliably?
    • A) By using retry mechanisms for failed tasks
    • B) By encrypting data during transfer
    • C) By compressing all data streams
    • D) By running in single-threaded mode
  4. What is a key prerequisite for using distcp between clusters?
    • A) Both clusters must use the same HDFS version
    • B) Both clusters must share the same namenode
    • C) Both clusters must have Hive installed
    • D) Both clusters must be in the same physical location
  5. Which option in distcp ensures that the source files are deleted after successful transfer?
    • A) -move
    • B) -delete-source
    • C) -rm
    • D) -clean-up
  6. What is the purpose of the -update option in the distcp command?
    • A) To synchronize only changed or new files
    • B) To overwrite all files in the target cluster
    • C) To delete outdated files in the source cluster
    • D) To compress transferred files
  7. Which of the following is a limitation of using distcp for data migration?
    • A) It cannot handle compressed files
    • B) It requires manual task retries
    • C) It does not support encryption
    • D) It is slower for small files

Backup and Restore in HDFS

  1. Which of the following tools can be used to back up HDFS data?
    • A) Snapshots
    • B) distcp
    • C) Hive
    • D) Pig
  2. What is the primary purpose of HDFS snapshots?
    • A) To replicate data between clusters
    • B) To create a read-only point-in-time view of data
    • C) To compress HDFS files
    • D) To track file access history
  3. HDFS snapshots are stored:
    • A) In the same directory as the original data
    • B) In a separate snapshot directory
    • C) In a database table
    • D) On a backup server
  4. How can you restore data from an HDFS snapshot?
    • A) By copying the snapshot data to the original directory
    • B) By running the snapshot-restore command
    • C) By enabling versioning
    • D) By using a MapReduce job
  5. What is a key limitation of HDFS snapshots?
    • A) They require downtime during creation
    • B) They increase storage usage significantly
    • C) They do not preserve file permissions
    • D) They cannot be deleted once created
  6. Which command lists all snapshots for a given HDFS directory?
    • A) hdfs snapshot-list
    • B) hdfs snapshot-view
    • C) hdfs lsSnap
    • D) hdfs ls
  7. Which strategy is recommended for ensuring data recovery in case of HDFS failure?
    • A) Periodic snapshots and offsite replication
    • B) Compressing data periodically
    • C) Upgrading the cluster hardware regularly
    • D) Using only encrypted storage

Answer Key

QnoAnswer (Option with the text)
1B) Streaming large volumes of log data
2A) Source, Sink, and Channel
3A) A storage buffer between the source and sink
4B) Sink
5A) To filter or modify events
6C) Syslog source
7B) Scalability for large data streams
8A) By replicating events across multiple channels
9A) Importing/exporting structured data between HDFS and relational databases
10A) sqoop-import
11A) –target-path
12B) Defining the column for parallel data transfer
13D) XML
14A) By increasing the number of mappers
15B) sqoop-export
16A) It specifies the database URL
17A) distcp
18A) MapReduce framework
19A) By using retry mechanisms for failed tasks
20A) Both clusters must use the same HDFS version
21B) -delete-source
22A) To synchronize only changed or new files
23D) It is slower for small files
24A) Snapshots
25B) To create a read-only point-in-time view of data
26B) In a separate snapshot directory
27A) By copying the snapshot data to the original directory
28B) They increase storage usage significantly
29A) hdfs snapshot-list
30A) Periodic snapshots and offsite replication

Use a Blank Sheet, Note your Answers and Finally tally with our answer at last. Give Yourself Score.

X
error: Content is protected !!
Scroll to Top