Chapter 3 Data transformation
In our data transformation script, we perform the following steps:
Since all data are
.csvfiles, we load them using theread.csv()function in R.We drop the unnecessary variables in the bike-sharing data:
Duration,Start Station,Rnd Station, andBike number.We choose the variable
Start Dateto represent the occurrence time of the bike riding. It is separated into four date and time variables:Year,Month,Day, andHour.For the temperature data, we change the format of the variable
DATEto the standard practiceYYYY-MM-DD.We merge the two data sets on the index date variable by using
left_join()function in thedplyrpackage.We add the variable
workdays, which is true if the day is not a holiday or weekend and false otherwise.Since the size of the data frame is too large with every row representing a single trip, we change the granularity to hour by aggregation over the total count of bike riding trips in each hour of a day.
The derived data set is condensed to 16,889 rows with 13 variables:
- Year - Year of the occurrence of the bike riding, 2019
- Month - Month of the occurrence of the bike riding
- Day - Day of the occurrence of the bike riding
- Hour - Hour of the occurrence of the bike riding
- PRCP - The amount of pricipitation
- SNOW - The amount of snow
- SNWD - The depth of snow
- TAVG - The average temperature in Fahrenheit
- TMAX - The maximum temperature in Fahrenheit
- TMIN - The minimum temperature in Fahrenheit
- workdays - Indicating if it is a workday
- Member.Type – Indicates whether user was a “registered” member or a “casual” rider
- num - Number of bike riding