Chapter 2 Data Sources

2.1 Bike sharing data

As we found our common interests in transportation systems, we noticed that unlike public transportation systems such as bus and subway, bike sharing systems have explicitly recorded statistics like the duration of travel, departure and arrival position. We then searched online for data related to bike rental. At first, we found a data set about Capital Bikeshare on Kaggle, a website that has many data sets for data scientists to work on. To ensure the reliability of our data, we decided to use the data from the original source — the Capital Bikeshare database. Another advantage of going with the original source is that it allowed us to analyze the most up-to-date riding data in 2019, compared to the outdated one in 2012 used in the Kaggle’s processed data. We did not choose the data in 2020 because the pandemic has a negative influence on people’s lifestyle and thus the analysis will not be accurate. The data set we worked on can be found on http://capitalbikeshare.com/system-data.

The data set has 3,398,417 observations and 7 variables:

  • Duration – Duration of trip
  • Start Date – Includes start date and time
  • End Date – Includes end date and time
  • Start Station – Includes starting station name and number
  • End Station – Includes ending station name and number
  • Bike Number – Includes ID number of bike used for the trip
  • Member Type – Indicates whether user was a “registered” member or a “casual” rider

2.2 Weather data

The weather data are collected from https://www.ncdc.noaa.gov/ to add the weather condition for each day in the year. Since the data files are not publicly accessible from the website, we submitted a download request to https://www.ncdc.noaa.gov/. The data file is located at /temp_data/temp.csv.

The data set has 365 observations and 7 variables:

  • DATE - The index variable indicating the date when data was collected.
  • PRCP - The amount of pricipitation
  • SNOW - The amount of snow
  • SNWD - The depth of snow
  • TAVG - The average temperature in Fahrenheit
  • TMAX - The maximum temperature in Fahrenheit
  • TMIN - The minimum temperature in Fahrenheit

2.3 Holiday schedule

The holiday schedule can be found on http://dchr.dc.gov/page/holiday-schedule. Since there is no downloadable data file from the website and there are only couples of holidays in 2019, we manually add them to the data frame in the data transformation stage.

2.4 Issues

  1. A couple of variables such as Bike Number in the bike-sharing data are unnecessary for our research purpose.
  2. The variables Start Date and End Date both contain the date and time information, which we need to separate for better analysis.
  3. The data size is so large that the R computation is slow during the analysis.