Chapter 3 Data transformation
In our data transformation script, we perform the following steps:
Since all data are
.csv
files, we load them using theread.csv()
function in R.We drop the unnecessary variables in the bike-sharing data:
Duration
,Start Station
,Rnd Station
, andBike number
.We choose the variable
Start Date
to represent the occurrence time of the bike riding. It is separated into four date and time variables:Year
,Month
,Day
, andHour
.For the temperature data, we change the format of the variable
DATE
to the standard practiceYYYY-MM-DD
.We merge the two data sets on the index date variable by using
left_join()
function in thedplyr
package.We add the variable
workdays
, which is true if the day is not a holiday or weekend and false otherwise.Since the size of the data frame is too large with every row representing a single trip, we change the granularity to hour by aggregation over the total count of bike riding trips in each hour of a day.
The derived data set is condensed to 16,889 rows with 13 variables:
- Year - Year of the occurrence of the bike riding, 2019
- Month - Month of the occurrence of the bike riding
- Day - Day of the occurrence of the bike riding
- Hour - Hour of the occurrence of the bike riding
- PRCP - The amount of pricipitation
- SNOW - The amount of snow
- SNWD - The depth of snow
- TAVG - The average temperature in Fahrenheit
- TMAX - The maximum temperature in Fahrenheit
- TMIN - The minimum temperature in Fahrenheit
- workdays - Indicating if it is a workday
- Member.Type – Indicates whether user was a “registered” member or a “casual” rider
- num - Number of bike riding