The Common Table Expressions or CTE’s for short are used within SQL databases to simplify complex joins and subqueries. You can think of them as named subqueries that can be referenced in other parts of the query, including other CTEs (i.e. recursive CTEs).
In Presto, CTEs take the form of WITH
clauses. There can be multiple named CTE within the single
WITH clause. They need to be defined before the main
SELECT query.
At the time of this writing (late 2022), DBI APIs don’t
have an official way of implementing CTEs yet. dbplyr just
begins to introduce CTEs into its APIs as an experimental feature. So
our implementation of CTE support in RPresto is very much
avant-garde and should be used with discretion.
We attach CTEs to the PrestoConnection so that they are
available across the queries executed via the connection.
You can check your RPresto version by running the
packageVersion() function. You need version 1.4.0
or later to use the CTE feature.
PrestoConnectionYou can define and attach CTEs while creating a
PrestoConnection. Here we assume that the user already have
a Presto server with a memory connector set up. If you don’t have such a
server set up, refer to the Presto
documentation for instructions if you want to follow along.
con <- DBI::dbConnect(
  drv = RPresto::Presto(),
  host = "http://localhost",
  port = 8080,
  user = Sys.getenv("USER"),
  catalog = "memory",
  schema = "default",
  # Define a testing CTE using dummy VALUES
  ctes = list(
    "dummy_values" =
      "SELECT * FROM (VALUES (1, 'a'), (2, 'b'), (3, 'c') ) AS t (id, name)"
  )
)Now dummy_values is not an existing permanent table
available in the PrestoConnection. It only exists as a
temporary feature for the connection.
We can read the content of the CTE.
dbReadTable(con, "dummy_values")
#> # A tibble: 3 × 2
#>      id name 
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b    
#> 3     3 cWe can also execute arbitrary SELECT queries on top of
the CTE.
dplyr backendAnother way of leveraging CTEs in your workflow is to incorporate
them into the dplyr workflow.
# We first copy mtcars to Presto and create a remote table on it
tbl.mtcars <- copy_to(con, mtcars, "test_mtcars", overwrite = TRUE)
tbl.mtcars %>% colnames()
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"We call a few dplyr verbs on the remote table to mimic a
typical analysis work flow.
tbl.mtcars.transform <- tbl.mtcars %>%
  mutate(hp2 = pow(hp, 2)) %>%
  group_by(cyl) %>%
  mutate(mean_mpg_by_cyl = mean(mpg, na.rm = TRUE))We can see the underlying SQL query generated so far.
tbl.mtcars.transform %>% show_query()
#> <SQL>
#> SELECT "q01".*, AVG("mpg") OVER (PARTITION BY "cyl") AS "mean_mpg_by_cyl"
#> FROM (
#>   SELECT "test_mtcars".*, pow("hp", 2.0) AS "hp2"
#>   FROM "test_mtcars"
#> ) "q01"For illustration, let’s say we filter the same transformed table
twice on the cyl field and UNION ALL them
together in the next step.
tbl.mtcars.union <- union(
  filter(tbl.mtcars.transform, cyl == 4L),
  filter(tbl.mtcars.transform, cyl == 8L),
  all = TRUE
)
tbl.mtcars.union %>% show_query()
#> <SQL>
#> SELECT "q01".*
#> FROM (
#>   SELECT "q01".*, AVG("mpg") OVER (PARTITION BY "cyl") AS "mean_mpg_by_cyl"
#>   FROM (
#>     SELECT "test_mtcars".*, pow("hp", 2.0) AS "hp2"
#>     FROM "test_mtcars"
#>   ) "q01"
#> ) "q01"
#> WHERE ("cyl" = 4)
#> 
#> UNION ALL
#> 
#> SELECT "q01".*
#> FROM (
#>   SELECT "q01".*, AVG("mpg") OVER (PARTITION BY "cyl") AS "mean_mpg_by_cyl"
#>   FROM (
#>     SELECT "test_mtcars".*, pow("hp", 2.0) AS "hp2"
#>     FROM "test_mtcars"
#>   ) "q01"
#> ) "q01"
#> WHERE ("cyl" = 8)The underlying SQL query of the transformed table
(i.e. tbl.mtcars.transform) has to be replicated twice in
this step and thus makes the resulting query long and repetitive. It
offers a prime opportunity to simplify using CTEs.
We can “save” the underlying SQL query of the transformed table into
a CTE and use that in the union step by calling the
compute() function with cte = TRUE.
tbl.mtcars.transform <- tbl.mtcars.transform %>%
  compute(name = "mtcars_transform", cte = TRUE)
tbl.mtcars.transform %>% show_query()
#> <SQL>
#> WITH "mtcars_transform" AS (
#> SELECT "q01".*, AVG("mpg") OVER (PARTITION BY "cyl") AS "mean_mpg_by_cyl"
#> FROM (
#>   SELECT "test_mtcars".*, pow("hp", 2.0) AS "hp2"
#>   FROM "test_mtcars"
#> ) "q01"
#> )
#> SELECT *
#> FROM "mtcars_transform"Here the content of tbl.mtcars.transform hasn’t changed
at all and we can use the remote table as it is just like before. The
only change underneath is that the underlying logic is now captured and
stored in a CTE. You can almost think of it as saving
tbl.mtcars.transform as a temporary table named
mtcars_transform and pointing the new remote table on that
temporary table. The difference is that no query has actually been
executed yet.
Now we’ve leveraged CTE, the query for the union step looks more clean and readable.
tbl.mtcars.union <- union(
  filter(tbl.mtcars.transform, cyl == 4L),
  filter(tbl.mtcars.transform, cyl == 8L),
  all = TRUE
)
tbl.mtcars.union %>% show_query()
#> <SQL>
#> WITH "mtcars_transform" AS (
#> SELECT "q01".*, AVG("mpg") OVER (PARTITION BY "cyl") AS "mean_mpg_by_cyl"
#> FROM (
#>   SELECT "test_mtcars".*, pow("hp", 2.0) AS "hp2"
#>   FROM "test_mtcars"
#> ) "q01"
#> )
#> SELECT "mtcars_transform".*
#> FROM "mtcars_transform"
#> WHERE ("cyl" = 4)
#> 
#> UNION ALL
#> 
#> SELECT "mtcars_transform".*
#> FROM "mtcars_transform"
#> WHERE ("cyl" = 8)We can even create nested CTEs that depend on other CTEs (Presto
calls it chained CTEs). Below we call compute() on
tbl.mtcars.union which already utilizies the
mtcars_transform CTE.
tbl.mtcars.union <- tbl.mtcars.union %>%
  compute(name = "mtcars_union", cte = TRUE)
tbl.mtcars.union %>% show_query()
#> <SQL>
#> WITH "mtcars_transform" AS (
#> SELECT "q01".*, AVG("mpg") OVER (PARTITION BY "cyl") AS "mean_mpg_by_cyl"
#> FROM (
#>   SELECT "test_mtcars".*, pow("hp", 2.0) AS "hp2"
#>   FROM "test_mtcars"
#> ) "q01"
#> ),
#> "mtcars_union" AS (
#> SELECT "mtcars_transform".*
#> FROM "mtcars_transform"
#> WHERE ("cyl" = 4)
#> 
#> UNION ALL
#> 
#> SELECT "mtcars_transform".*
#> FROM "mtcars_transform"
#> WHERE ("cyl" = 8)
#> )
#> SELECT *
#> FROM "mtcars_union"Now the underlying query of the previous
tbl.mtcars.union is saved into the
mtcars_union CTE which in turn depends on the
mtcars_transform CTE.