Personal Site: JavierOrracaDeatcu.com | LinkedIn: linkedin.com/in/orraca
Personal Site: JavierOrracaDeatcu.com | LinkedIn: linkedin.com/in/orraca
Personal Site: JavierOrracaDeatcu.com | LinkedIn: linkedin.com/in/orraca
Personal Site: JavierOrracaDeatcu.com | LinkedIn: linkedin.com/in/orraca
Shiny is a code-based web app framework for Python and R
Easy-ish-ly build web apps with no formal web dev experience
Use reactive programming that allows for dynamic, real-time updates to the app based on user input
Extend your Shiny apps with HTML widgets, real-time data polling, JavaScript, CSS, and more
Learn more about Shiny for Python and Shiny for R
Most corporate dashboards feel clunky…
They’re laggy, very boxy, e.g., Power BI, Tableau, MicroStrategy
Dashboards typically struggle with fast, real-time calculations, search, and data manipulation capabilities
For hiring teams, you can use Shiny as a resume accessory to demonstrate how easy it is to develop and deploy a web app styled with the company’s brand aesthetics
Check out my blog post: Impressing Hiring Teams with a Shiny App Demo (2023-03-28)
Bloomreach’s website:
Check out my blog post: Impressing Hiring Teams with a Shiny App Demo (2023-03-28)
Javier’s Shiny demo:
Launch this live Shiny app demo: https://javierorraca.shinyapps.io/Bloomreach_Shiny_App
Posit’s Shiny Assistant blog post | Try the Shiny Assistant for yourself!
R for Data Science, 2nd Edition by Wickham, Çetinkaya-Rundel, & Grolemund | https://r4ds.hadley.nz
Learn more about the open-source RStudio IDE | https://posit.co/products/open-source/rstudio
RStudio User Guide | https://docs.posit.co/ide/user/ide/get-started
Learn about Posit Professional Products & Solutions | Deploy your content on Posit’s beta Connect Cloud!
Learn about Microsoft’s Virtual Studio Code (“VS Code”) User Interface | https://code.visualstudio.com/docs/getstarted/userinterface
Image: © Ander Fernández Jauregui | Learn about Project Jupyter’s Jupyter Notebook
RStudio IDE’s “always on” panes welcomed R users seeking a data-analysis-first experience
For Python users, RStudio felt too R-centric and other tools worked just fine including VS Code, Jupyter Notebooks, PyCharm, etc.
There are many programming languages that can be used for data analysis, but Python and R are the de facto standards for data science
So, why isn’t there
one tool
to rule them all!?
Me, using RStudio while my teammates use VS Code…
Visit the new Positron website to learn more | https://positron.posit.co/
- A next-generation data science IDE built by Posit PBC
- An extensible, polyglot tool for writing code and exploring data
- A familiar environment for reproducible authoring and publishing
Visit the new Positron website to learn more | https://positron.posit.co/
Layout by Dr. Athanasia Mo Mowinckel | Positron IDE - A new IDE for data science
Rtools
. RTools
contains the required compilers needed to build R packages from source code on WindowsGetting Started with Positron | Machine Prerequisites
sqlite-devel
or libsqlite3-dev
) ahead of time so pyenv
can build your Python version(s) of choice ipykernel
venv
or conda
to manage your Python projects, you can install ipykernal
manually as follows: python3 -m pip install ipykernel
Getting Started with Positron | Machine Prerequisites
Getting Started with Positron | Machine Prerequisites
Getting Started with Positron | Interpreter Selection
Learn more on Emil Hvitfeldt’s blog post Positron: My Key Bindings
Shortcut | Description |
---|---|
Cmd/Ctrl+Enter | Run the selected code in the editor; if no code is selected, run the current statement |
Cmd/Ctrl+Shift+0 | Restart the interpreter currently open in the Console |
Cmd/Ctrl+Shift+Enter | Run the file open in the editor (using e.g. source() or %run ) |
F1 | Show contextual help for the topic under the cursor |
Cmd/Ctrl+K, Cmd/Ctrl+R | Show contextual help for the topic under the cursor (alternate binding) |
Cmd/Ctrl+K, F | Focus the Console |
Ctrl+L | Clear the Console |
Positron docs | Keyboard Shortcuts
Shortcut | Description |
---|---|
Cmd/Ctrl+Shift+M | Insert the pipe operator (|> or %>% ) |
Alt+- | Insert the assignment operator (<- ) |
Cmd/Ctrl+Shift+L | Load the current R package, if any |
Cmd/Ctrl+Shift+B | Build and install the current R package, if any |
Cmd/Ctrl+Shift+T | Test the current R package, if any |
Cmd/Ctrl+Shift+E | Check the current R package, if any |
Cmd/Ctrl+Shift+D | Document the current R package, if any |
Positron docs | Keyboard Shortcuts
If you’re an experienced RStudio user, you can easily set the RStudio keybindings in the Positron settings:
Positron docs | Keyboard Shortcuts
pandas
, polars
) or R (data.frame
, tibble
, data.table
, polars
) Positron docs | Data Explorer
Positron docs | Data Explorer
Positron docs | Data Explorer
Positron docs | Data Explorer
Positron docs | Data Explorer
Positron docs | Data Explorer
- Explore database connections established with ODBC drivers or packages
- For Python users, the sqlite3
and SQLAlchemy
packages are supported
- For R users, the odbc
, sparklyr
, bigrquery
, and more packages are supported
Positron docs | Connections Pane
- Instead of running apps from a Terminal, Positron lets you run supported apps by clicking the Play button in Editor Actions
- Supported apps include the following: Shiny, Dash, FastAPI, Flask, Gradio, and Streamlit
- You can also start apps in Debug mode
Positron docs | Run Interactive Apps
Visit Posit’s new Positron website and their YouTube channel to learn more
Learn more about Ark on the Ark GitHub repo
So, lesson learned…
Positron = Amazing.
But next…
let’s talk about
ULTRA
FAST
ETL 💨
Learn more about Apache Arrow, DuckDB, Polars, and Apache Parquet
.parquet
files, the Arrow packages for C++, Python, and R support reading entire directories of files and treating them as a single dataset pyarrow
-based projects using R’s reticulate
package. PySpark
and R’s sparklyr
take advantage of Arrow for significant performance gains when transferring dataLearn more about Apache Arrow, DuckDB, Polars, and Apache Parquet
Python
R
# Arrow env vars for AWS S3 & GCP support
Sys.setenv(LIBARROW_MINIMAL = "false")
Sys.setenv(ARROW_S3 = "ON")
Sys.setenv(ARROW_GCS = "ON")
# Enable Polars install with pre-built
# Rust library binary
Sys.setenv(NOT_CRAN = "true")
# Install packages
install.packages(c("arrow", "duckdb"))
install.packages("polars", repos = "https://community.r-multiverse.org")
install.packages("polarssql", repos = "https://rpolars.r-universe.dev")
.parquet
dataset was partitioned by Year and Month (120 files)Arrow | Learn more at https://arrow.apache.org/
Python
import pyarrow.dataset as ds
import pyarrow.compute as pc
import os
# Set path for download
# NYC Taxi Data download (40 GB)
data_path = os.path.join("data", "nyc-taxi")
# Open connection to the remote dataset
nyc_dataset = ds.dataset(
"s3://voltrondata-labs-datasets/nyc-taxi",
format = "parquet"
)
# Filter for years 2012 - 2021
filtered_table = nyc_dataset.to_table(
filter = ds.field("year").isin(list(range(2012, 2022)))
)
# Write out the filtered data, partitioned by year and month
ds.write_dataset(
filtered_table,
base_dir = data_path,
format = "parquet",
partitioning = ["year", "month"]
)
R
library(here)
library(arrow)
library(dplyr)
# Set path for download
# NYC Taxi Data download (40 GB)
data_path <- here::here("data/nyc-taxi")
# Open connection to the remote dataset,
# filter for years 2012 - 2021, and
# write out the filtered data,
# partitioned by year and month
open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |>
filter(year %in% 2012:2021) |>
write_dataset(data_path, partitioning = c("year", "month"))
Arrow | Learn more at https://arrow.apache.org/
R
Arrow | Learn more at https://arrow.apache.org/
Arrow | Learn more at https://arrow.apache.org/
dplyr
, we can perform ETL on an arrow
table and collect()
the results back to a df# Open Arrow connection to dataset (40 GB)
nyc_taxi <- open_dataset(here::here("data/nyc-taxi"))
# Benchmark dplyr pipeline
bnch <- bench::mark(
min_iterations = 100,
arrow = nyc_taxi |>
filter(payment_type %in% c("Credit card", "Cash")) |>
group_by(payment_type) |>
summarise(mean_fare = mean(fare_amount, na.rm = T),
mean_tip = mean(tip_amount, na.rm = T)) |>
ungroup() |>
dplyr::collect()
)
autoplot(bnch)
Arrow | Arrow + dplyr compatibility
Arrow | Arrow + dplyr compatibility
stringr::str_replace_na()
example: stringr::str_replace_na()
Arrow | Arrow + dplyr compatibility
Me, struggling to use my favorite tidyverse
functions with Arrow…
but wait!
a solution exists
stringr::str_replace_na()
with the Arrow kernelArrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
arrow::schema()
on your Arrow table to review the field name and data type pairsvendor_name
field, I know I’ll be working with an Arrow string()
data typeArrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
arrow::register_scalar_function()
replace_arrow_nas
auto_convert = TRUE
Arrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
replace_arrow_nas()
Arrow UDF works…Arrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
replace_arrow_nas()
Arrow UDF works…Arrow | Learn more about registering Arrow User Defined Functions (“UDFs”)
Eevee can’t believe it… “Wooooooooooow, Javi!”
- DuckDB Labs created an in-line database management system, like a SQLite database engine, but optimized for distributed compute and optimized for larger-than-memory analysis
- The duckdb
package for Python offers a state-of-the-art optimizer that pushes down filters and projections directly into Arrow scans
- As a result, only relevant columns and partitions will be read thus significantly accelerates query execution
- DuckDB comes with core and community extensions that expand the framework to, e.g., scan remote Databricks Unity Catalog tables without needing to spin up a Spark cluster
DuckDB Labs | Learn about DuckDB and DuckDB Extensions
Python
import duckdb
# Connect to an in-memory DuckDB instance and scan
# the Parquet data set to make a temp View
con = duckdb.connect()
con.execute("""
CREATE VIEW nyc_taxi AS
SELECT * FROM read_parquet('data/nyc-taxi/**/*.parquet', hive_partitioning=true)
""")
# Run your SQL query
df = con.execute("""
SELECT
payment_type,
AVG(fare_amount) AS mean_fare,
AVG(tip_amount) AS mean_tip
FROM nyc_taxi
WHERE payment_type IN ('Credit card', 'Cash')
GROUP BY payment_type
""").df()
print(df)
R
library(duckdb)
# Connect to an in-memory DuckDB instance and scan
# the Parquet data set to make a temp View
con <- dbConnect(duckdb())
dbExecute(con, "
CREATE VIEW nyc_taxi AS
SELECT * FROM read_parquet('data/nyc-taxi/**/*.parquet', hive_partitioning = true)"
)
# Run your SQL query
df <- dbGetQuery(con, "
SELECT
payment_type,
AVG(fare_amount) AS mean_fare,
AVG(tip_amount) AS mean_tip
FROM nyc_taxi
WHERE payment_type IN ('Credit card', 'Cash')
GROUP BY payment_type
")
print(df)
DuckDB Labs | Learn how DuckDB quacks Arrow
# DuckDB via Python
# Open dataset using year,month folder partition
nyc = ds.dataset('nyc-taxi/', partitioning=["year", "month"])
# Get database connection
con = duckdb.connect()
# Run query that selects part of the data
query = con.execute("SELECT total_amount, passenger_count,year FROM nyc where total_amount > 100 and year > 2014")
# Create Record Batch Reader from Query Result.
# "fetch_record_batch()" also accepts an extra parameter related to the desired produced chunk size.
record_batch_reader = query.fetch_record_batch()
# Retrieve all batch chunks
chunk = record_batch_reader.read_next_batch()
while len(chunk) > 0:
chunk = record_batch_reader.read_next_batch()
# We must exclude one of the columns of the NYC dataset due to an unimplemented cast in Arrow
working_columns = ["vendor_id","pickup_at","dropoff_at","passenger_count","trip_distance","pickup_longitude",
"pickup_latitude","store_and_fwd_flag","dropoff_longitude","dropoff_latitude","payment_type",
"fare_amount","extra","mta_tax","tip_amount","tolls_amount","total_amount","year", "month"]
# Open dataset using year,month folder partition
nyc_dataset = ds.dataset(dir, partitioning=["year", "month"])
# Generate a scanner to skip problematic column
dataset_scanner = nyc_dataset.scanner(columns=working_columns)
# Materialize dataset to an Arrow Table
nyc_table = dataset_scanner.to_table()
# Generate Dataframe from Arow Table
nyc_df = nyc_table.to_pandas()
# Apply Filter
filtered_df = nyc_df[
(nyc_df.total_amount > 100) &
(nyc_df.year >2014)]
# Apply Projection
res = filtered_df[["total_amount", "passenger_count","year"]]
# Transform Result back to an Arrow Table
new_table = pa.Table.from_pandas(res)
DuckDB Labs | Learn how DuckDB quacks Arrow
pandas
runtime was 146.91 secondsduckdb
runtime was 0.05 secondsduckdb
vs pandas
DuckDB Labs | Learn how DuckDB quacks Arrow
duckplyr
, from DuckDB Labs, offers 1:1 compatibility with dplyr
functions but there are some caveats:
dplyr
’s summarize()
function with the .by
argument as group_by()
is not be supportedduckplyr | Learn more at https://duckdblabs.com/
polars | Learn more at https://pola.rs/
polars | Learn more at https://pola.rs/
dplyr
syntaxdplyr
, arrow
, duckdb
, duckplyr
, and polars
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
DuckDB Labs | Learn more at https://duckdblabs.com/
Just to clarify…
It’s Quarto.
Q-U-A-R-T-O.
Not #4 in Spanish (or “cuatro”).
.ipynb
files (and in many ways, the successor to R Markdown .Rmd
), Quarto lets you develop static and interactive reproducible, production quality content including articles, presentations, dashboards, websites, blogs, and books in HTML, PDF, MS Word, ePub, and more .css
or .scss
, the new _brand.yml
file can be used with your Quarto and Shiny projects to provide a unifying and portable branding frameworkLearn more about Quarto and brand.yml | Posit blog post: Unified branding across Posit tools with brand.yml
The following sections were copied almost entirely from Quarto’s Reveal.js demo documentation
The next few slides cover what you can do with Quarto and Reveal.js including:
Quarto | Learn more on Quarto’s Reveal.js demo presentation
# Define a server for the Shiny app
function(input, output) {
# Fill in the spot we created for a plot
output$phonePlot <- renderPlot({
# Render a barplot
})
}
Quarto | Syntax Highlighting
# Define a server for the Shiny app
function(input, output) {
# Fill in the spot we created for a plot
output$phonePlot <- renderPlot({
# Render a barplot
barplot(WorldPhones[,input$region]*1000,
main=input$region,
ylab="Number of Telephones",
xlab="Year")
})
}
Quarto | Code Animations
import numpy as np
import matplotlib.pyplot as plt
r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, r)
ax.set_rticks([0.5, 1, 1.5, 2])
ax.grid(True)
plt.show()
Quarto | Line Highlighting
To include a LaTeX equation in Quarto, you would use the double dollar sign delimiters ($$) for a display equation on a separate line, like this:
$$x = \frac{-b \pm \sqrt{(b^2 - 4ac)}}{2a}$$
\[x = \frac{-b \pm \sqrt{(b^2 - 4ac)}}{2a}\]
Quarto | LaTeX Equations
Great for training purposes or when you’ve got a time limit
00:10
Quarto | Learn more about the countdown package
Thank you! 🤍
Connect with me!