Introduction to DataExplorer

Boxuan Cui

2020-01-07

This document introduces the package DataExplorer, and shows how it can help you with different tasks throughout your data exploration process.

There are 3 main goals for DataExplorer:

  1. Exploratory Data Analysis (EDA)
  2. Feature Engineering
  3. Data Reporting

The remaining of this guide will be organized in accordance with the goals. As the package evolves, more content will be added.

Data

We will be using the nycflights13 datasets for this document. If you have not installed the package, please do the following:

install.packages("nycflights13")
library(nycflights13)

There are 5 datasets in this package:

If you want to quickly visualize the structure of all, you may do the following:

library(DataExplorer)
data_list <- list(airlines, airports, flights, planes, weather)
plot_str(data_list)

You may also try plot_str(data_list, type = "r") for a radial network.


Now let’s merge all tables together for a more robust dataset for later sections.

merge_airlines <- merge(flights, airlines, by = "carrier", all.x = TRUE)
merge_planes <- merge(merge_airlines, planes, by = "tailnum", all.x = TRUE, suffixes = c("_flights", "_planes"))
merge_airports_origin <- merge(merge_planes, airports, by.x = "origin", by.y = "faa", all.x = TRUE, suffixes = c("_carrier", "_origin"))
final_data <- merge(merge_airports_origin, airports, by.x = "dest", by.y = "faa", all.x = TRUE, suffixes = c("_origin", "_dest"))

Exploratory Data Analysis

Exploratory data analysis is the process to get to know your data, so that you can generate and test your hypothesis. Visualization techniques are usually applied.

To get introduced to your newly created dataset:

introduce(final_data)
rows 336,776
columns 42
discrete_columns 16
continuous_columns 26
all_missing_columns 0
total_missing_values 809,170
complete_rows 906
total_observations 14,144,592
memory_usage 97,254,656

To visualize the table above (with some light analysis):

plot_intro(final_data)

You should immediately notice some surprises:

  1. 0.3% complete rows: This means only 0.3% of all rows are not completely missing!
  2. 5.7% missing observations: Given the 0.3% complete rows, there are only 5.7% total missing observations.

Missing values are definitely creating problems. Let’s take a look at the missing profiles.

Missing values

Real-world data is messy, and you can simply use plot_missing function to visualize missing profile for each feature.

plot_missing(final_data)

From the chart, speed variable is mostly missing, and probably not informative. Looks like we have found the culprit for the 0.3% complete rows. Let’s drop it:

final_data <- drop_columns(final_data, "speed")

Note: You may store the missing data profile with profile_missing(final_data) for additional analysis.

Distributions

Bar Charts

To visualize frequency distributions for all discrete features:

plot_bar(final_data)
## 5 columns ignored with more than 50 categories.
## dest: 105 categories
## tailnum: 4044 categories
## time_hour: 6936 categories
## model: 128 categories
## name: 102 categories

Upon closer inspection of manufacturer variable, it is not hard to identify the following duplications:

  • AIRBUS and AIRBUS INDUSTRIE
  • CANADAIR and CANADAIR LTD
  • MCDONNELL DOUGLAS, MCDONNELL DOUGLAS AIRCRAFT CO and MCDONNELL DOUGLAS CORPORATION

Let’s clean it up and look at the manufacturer distribution again:

final_data[which(final_data$manufacturer == "AIRBUS INDUSTRIE"),]$manufacturer <- "AIRBUS"
final_data[which(final_data$manufacturer == "CANADAIR LTD"),]$manufacturer <- "CANADAIR"
final_data[which(final_data$manufacturer %in% c("MCDONNELL DOUGLAS AIRCRAFT CO", "MCDONNELL DOUGLAS CORPORATION")),]$manufacturer <- "MCDONNELL DOUGLAS"

plot_bar(final_data$manufacturer)

Feature dst_origin and tzone_origin contains only 1 value, so we should drop them:

final_data <- drop_columns(final_data, c("dst_origin", "tzone_origin"))

Frequently, it is very beneficial to look at bivariate frequency distribution. For example, to look at discrete features by arr_delay:

plot_bar(final_data, with = "arr_delay")
## 5 columns ignored with more than 50 categories.
## dest: 105 categories
## tailnum: 4044 categories
## time_hour: 6936 categories
## model: 128 categories
## name: 102 categories

The resulting distribution looks quite different from the regular frequency distribution.

Histograms

To visualize distributions for all continuous features:

plot_histogram(final_data)