STAT 19000: Project 10 — Fall 2020

Motivation: Functions are powerful. They are building blocks to more complex programs and behavior. In fact, there is an entire programming paradigm based on functions called functional programming. In this project, we will learn to apply functions to entire vectors of data using sapply.

Context: We’ve just taken some time to learn about and create functions. One of the more common "next steps" after creating a function is to use it on a series of data, like a vector. sapply is one of the best ways to do this in R.

Scope: r, sapply, functions

Learning objectives
  • Read and write basic (csv) data.

  • Explain and demonstrate: positional, named, and logical indexing.

  • Utilize apply functions in order to solve a data-driven problem.

  • Gain proficiency using split, merge, and subset.

Dataset

The following questions will use the dataset found in Scholar:

/class/datamine/data/okcupid/filtered

Questions

Please make sure to double check that the your submission does indeed contain the files you think it does. You can do this by downloading your submission from Gradescope after uploading. If you can see all of your files and they open up properly on your computer, you should be good to go.

Please make sure to look at your knit PDF before submitting. PDFs should be relatively short and not contain huge amounts of printed data. Remember you can use functions like head to print a sample of the data or output. Extremely large PDFs will be subject to lose points.

Question 1

Load up the the following datasets into data.frames named users and questions, respectively: /class/datamine/data/okcupid/filtered/users.csv, /class/datamine/data/okcupid/filtered/questions.csv. This is data from users on OkCupid, an online dating app. In your own words, explain what each file contains and how they are related — its always a good idea to poke around the data to get a better understanding of how things are structured!

Be careful, just because a file ends in .csv, does not mean it is comma-separated. You can change what separator read.csv uses with the sep argument. You can use the readLines function on a file (say, with n=10, for instance), to see the first lines of a file, and determine the character to use with the sep argument.

Items to submit
  • R code used to solve the problem.

  • 1-2 sentences describing what each file contains and how they are related.

Question 2

grep is an incredibly powerful tool available to us in R. We will learn more about grep in the future, but for now, know that a simple application of grep is to find a word in a string. In R, grep is vectorized and can be applied to an entire vector of strings. Use grep to find a question that references "google". What is the question?

If at first you don’t succeed, run ?grep and check out the ignore.case argument.

To prepare for Question 3, look at the entire row of the questions data frame that has the question about google. The first entry on this row tells you the question that you need, in the users data frame, while working on Question 3.

Items to submit
  • R code used to solve the problem.

  • The text of the question that references Google.

Question 3

In (2) we found a pretty interesting question. What is the percentage of users that Google someone before the first date? Does the proportion change by gender (as defined by gender2)? How about by gender_orientation?

The two videos posted in Question 2 might help.

If you look at the column of users corresponding to the question identified in (2), you will see that this column of users has two possible answers, namely: "No. Why spoil the mystery?" and "Yes. Knowledge is power!".

Use the tapply function with three inputs:

the correct column of users,

breaking up the data according to gender2 or according to gender_orientation,

and use this as your function in the tapply:

function(x) {prop.table(table(x, useNA="always"))}

Items to submit
  • R code used to solve this problem.

  • The results of running the code.

  • Written answers to the questions.

Question 4

In Project 8, we created a function called count_words. Use this function and sapply to create a vector which contains the number of words in each row of the column text from the questions dataframe. Call the new vector question_length, and add it as a column to the questions dataframe.

count_words <- function(my_text) {
    my_split_text <- unlist(strsplit(my_text, " "))

    return(length(my_split_text[my_split_text!=""]))
}
Items to submit
  • R code used to solve this problem.

  • The result of str(questions) (this shows how your questions data frame looks, after adding the new column called question_length).

Question 5

Consider this function called number_of_options that accepts a data frame (for instance, questions)…​

number_of_options <- function(myDF) {
   table(apply(as.matrix(myDF[ ,3:6]), 1, function(x) {sum(!(x==""))}))
}

…​and counts the number of questions that have each possible number of responses. For instance, if we calculate number_of_options(questions) we get:

` 0 2 3 4 590 936 519 746 `

which means that: 590 questions have 0 possible responses; 936 questions have 2 possible responses; 519 questions have 3 possible responses; and 746 questions have 4 possible responses.

Now use the split function to break the data frame questions into 7 smaller data frames, according to the value in questions$Keywords. Then use the sapply function to determine, for each possible value of questions$Keywords, the analogous breakdown of questions with different numbers of responses, as we did above.

You can write:

mylist <- split(questions, questions$Keywords)
sapply(mylist, number_of_options)

The way sapply works is the the first argument is by default the first argument to your function, the second argument is the function you want applied, and after that you can specify arguments by name. For example:

test1 <- c(1, 2, 3, 4, NA, 5)
test2 <- c(9, 8, 6, 5, 4, NA)
mylist <- list(first=test1, second=test2)
# for a single vector in the list
mean(mylist$first, na.rm=T)
# what if we want to do this for each vector in the list?
# how do we remove na's?
sapply(mylist, mean)
# we can specify the arguments that are for the mean function
# by naming them after the first two arguments, like this
sapply(mylist, mean, na.rm=T)
# in the code shown above, na.rm=T is passed to the mean function
# just like if you run the following
mean(mylist$first, na.rm=T)
mean(mylist$second, na.rm=T)
# you can include as many arguments to mean as you normally would
# and in any order. just make sure to name the arguments
sapply(mylist, mean, na.rm=T, trim=0.5)
# or sapply(mylist, mean, trim=0.5, na.rm=T)
# which is similar to
mean(mylist$first, na.rm=T, trim=0.5)
mean(mylist$second, na.rm=T, trim=0.5)
Items to submit
  • R code used to solve this problem.

  • The results of the running the code.

Question 6

Lots of questions are asked in this okcupid dataset. Explore the dataset, and either calculate an interesting statistic/result using sapply, or generate a graphic (with good x-axis and/or y-axis labels, main labels, legends, etc.), or both! Write 1-2 sentences about your analysis and/or graphic, and explain what you thought you’d find, and what you actually discovered.

Items to submit
  • R code used to solve this problem.

  • The results from running your code.

  • 1-2 sentences about your analysis and/or graphic, and explain what you thought you’d find, and what you actually discovered.

OPTIONAL QUESTION

Does it appear that there is an association between the length of the question and whether or not users answered the question? Assume NA means "unanswered". First create a function called percent_answered that, given a vector, returns the percentage of values that are not NA. Use percent_answered and sapply to calculate the percentage of users who answer each question. Plot this result, against the length of the questions.

length_of_questions ← questions$question_length[grep("^q", questions$X)]

grep("^q", questions$X) returns the column index of every column that starts with "q". Use the same trick we used in the previous hint, to subset our users data.frame before using sapply to apply percent_answered.

Items to submit
  • R code used to solve this problem.

  • The plot.

  • Whether or not you think there may or may not be an association between question length and whether or not the question is answered.