Data Cleaning and Preprocessing in R Master the essential techniques of data cleaning and preprocessing in R. Learn how to handle missing and duplicate data, transform data with dplyr, and perform data summarization efficiently.
Handling Missing and Duplicate Data
Which function is used to identify missing values in a data frame in R? a) missing() b) is.na() c) na.test() d) is.null()
What does the na.omit() function do in R? a) Replaces NA values with zeros b) Removes rows with missing values c) Replaces missing values with the mean d) Identifies missing values
How can you replace missing values with the mean of the column in R? a) replace.na() b) replace() c) mutate() d) fill()
Which of the following functions is used to remove duplicate rows in a data frame in R? a) distinct() b) remove_duplicates() c) unique() d) drop_duplicates()
What does the function complete.cases() return? a) Rows without any missing values b) Rows with missing values c) A summary of missing values d) A boolean value for each row
How can you count the number of missing values in a column of a data frame in R? a) sum(is.na(column)) b) count.na(column) c) is.na(column).sum() d) missing.count(column)
Which of the following is NOT a method for handling missing data in R? a) Removing rows with missing values b) Replacing missing values with zeros c) Replacing missing values with column mean d) Ignoring missing data
To identify duplicate rows based on specific columns, which function can be used? a) duplicated() b) remove_duplicates() c) distinct() d) unique()
Which R function is used to replace missing values in a data frame with a specified value? a) fill() b) replace() c) replace_na() d) substitute()
What is the purpose of the drop_na() function in R? a) Drop rows with all missing values b) Drop rows with specific missing values c) Drop columns with missing values d) None of the above
Data Transformation with dplyr
Which R package provides functions like mutate(), filter(), and select()? a) tidyr b) ggplot2 c) dplyr d) base
What does the mutate() function do in R? a) Adds new columns to a data frame b) Filters rows based on conditions c) Changes the structure of a data frame d) Selects specific columns
How do you select specific columns from a data frame using dplyr? a) filter() b) select() c) slice() d) mutate()
Which function in dplyr is used to filter rows based on conditions? a) select() b) mutate() c) filter() d) arrange()
To arrange the rows of a data frame in ascending order of a column, which function is used? a) sort() b) arrange() c) order() d) order_by()
How do you apply a function to each column of a data frame using dplyr? a) apply() b) summarize() c) mutate_all() d) map()
What is the purpose of the group_by() function in dplyr? a) To group data by specific columns for aggregation b) To rearrange the data c) To filter data based on conditions d) To transform data
Which function in dplyr is used to summarize data by a group? a) summarize() b) mutate() c) group_by() d) filter()
How do you create a new column in a data frame based on an existing one using dplyr? a) mutate() b) create() c) transform() d) add_column()
To calculate the mean of a column after grouping by another column, you would use: a) summarize(mean(column)) b) group_by(column) %>% summarize(mean()) c) mutate(mean(column)) d) mean_by(column)
Data Summarization Techniques
What does the summary() function provide in R? a) A summary of missing values b) A quick overview of statistics like mean, median, etc. c) The structure of a data frame d) A detailed visualization
Which R function is used to compute the mean of a numeric column? a) sum() b) mean() c) avg() d) calculate()
Which function is used to calculate the standard deviation of a numeric column in R? a) sd() b) std() c) stdev() d) deviation()
Which function would you use to get a quick count of non-missing values in a column? a) count() b) length() c) n() d) sum()
What does the table() function in R do? a) Displays summary statistics b) Creates a frequency table c) Transforms data d) Creates visualizations
Which function can be used to calculate the median of a numeric column in R? a) median() b) mean() c) average() d) calc.median()
How would you group data by one or more variables and then calculate the sum of each group in R? a) group_by() %>% summarise(sum()) b) summarise() %>% group_by() c) sum_group_by() d) group_sum()
To count the number of unique values in a column, which function would you use? a) unique() b) distinct() c) count() d) length()
Which of the following is used to calculate the interquartile range (IQR) of a numeric column in R? a) iqr() b) IQR() c) interquartile() d) range()
To get the five-number summary (minimum, first quartile, median, third quartile, maximum) of a numeric column, which function is used? a) summary() b) quantile() c) five_number_summary() d) stat_summary()
Answer Key
QNo
Answer (Option with text)
1
b) is.na()
2
b) Removes rows with missing values
3
b) replace()
4
a) distinct()
5
a) Rows without any missing values
6
a) sum(is.na(column))
7
d) Ignoring missing data
8
a) duplicated()
9
c) replace_na()
10
a) Drop rows with all missing values
11
c) dplyr
12
a) Adds new columns to a data frame
13
b) select()
14
c) filter()
15
b) arrange()
16
c) mutate_all()
17
a) To group data by specific columns for aggregation
18
a) summarize()
19
a) mutate()
20
b) group_by(column) %>% summarize(mean())
21
b) A quick overview of statistics like mean, median, etc.