Chapter 3 Data transformation

In our data transformation script, we perform the following steps:

Since all data are .csv files, we load them using the read.csv() function in R.
We drop the unnecessary variables in the bike-sharing data: Duration, Start Station, Rnd Station, and Bike number.
We choose the variable Start Date to represent the occurrence time of the bike riding. It is separated into four date and time variables: Year, Month, Day, and Hour.
For the temperature data, we change the format of the variable DATE to the standard practice YYYY-MM-DD.
We merge the two data sets on the index date variable by using left_join() function in the dplyr package.
We add the variable workdays, which is true if the day is not a holiday or weekend and false otherwise.
Since the size of the data frame is too large with every row representing a single trip, we change the granularity to hour by aggregation over the total count of bike riding trips in each hour of a day.

The derived data set is condensed to 16,889 rows with 13 variables:

Year - Year of the occurrence of the bike riding, 2019
Month - Month of the occurrence of the bike riding
Day - Day of the occurrence of the bike riding
Hour - Hour of the occurrence of the bike riding
PRCP - The amount of pricipitation
SNOW - The amount of snow
SNWD - The depth of snow
TAVG - The average temperature in Fahrenheit
TMAX - The maximum temperature in Fahrenheit
TMIN - The minimum temperature in Fahrenheit
workdays - Indicating if it is a workday
Member.Type – Indicates whether user was a “registered” member or a “casual” rider
num - Number of bike riding