Chapter 3 Data transformation

In our data transformation script, we perform the following steps:

  1. Since all data are .csv files, we load them using the read.csv() function in R.

  2. We drop the unnecessary variables in the bike-sharing data: Duration, Start Station, Rnd Station, and Bike number.

  3. We choose the variable Start Date to represent the occurrence time of the bike riding. It is separated into four date and time variables: Year, Month, Day, and Hour.

  4. For the temperature data, we change the format of the variable DATE to the standard practice YYYY-MM-DD.

  5. We merge the two data sets on the index date variable by using left_join() function in the dplyr package.

  6. We add the variable workdays, which is true if the day is not a holiday or weekend and false otherwise.

  7. Since the size of the data frame is too large with every row representing a single trip, we change the granularity to hour by aggregation over the total count of bike riding trips in each hour of a day.

The derived data set is condensed to 16,889 rows with 13 variables:

  • Year - Year of the occurrence of the bike riding, 2019
  • Month - Month of the occurrence of the bike riding
  • Day - Day of the occurrence of the bike riding
  • Hour - Hour of the occurrence of the bike riding
  • PRCP - The amount of pricipitation
  • SNOW - The amount of snow
  • SNWD - The depth of snow
  • TAVG - The average temperature in Fahrenheit
  • TMAX - The maximum temperature in Fahrenheit
  • TMIN - The minimum temperature in Fahrenheit
  • workdays - Indicating if it is a workday
  • Member.Type – Indicates whether user was a “registered” member or a “casual” rider
  • num - Number of bike riding