| Title: | Common Department for Education Analysis Tasks |
|---|---|
| Description: | Preferred methods for common analytical tasks that are undertaken across the Department, including number formatting, project templates and curated reference data. |
| Authors: | Cam Race [aut, cre], Department for Education, England [cph], Laura Selby [aut], Adam Robinson [aut], Jen Machin [ctb], Jake Tufts [ctb], Rich Bielby [ctb] (ORCID: <https://orcid.org/0000-0001-9070-9969>), Menna Zayed [ctb], Lauren Snaathorst [ctb], Lara Garbett [ctb], Paula Rocha [ctb] |
| Maintainer: | Cam Race <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 1.2.0.9000 |
| Built: | 2026-05-20 10:01:46 UTC |
| Source: | https://github.com/dfe-analytical-services/dfer |
checks for air installation status and installs it if required, updating the global settings if selected
air_install(update_rstudio_settings = FALSE, verbose = TRUE)air_install(update_rstudio_settings = FALSE, verbose = TRUE)
update_rstudio_settings |
auto update RStudio settings |
verbose |
Run in verbose mode |
## Not run: air_install() ## End(Not run)## Not run: air_install() ## End(Not run)
styles the whole project or single file using air
air_style(target = ".", verbose = FALSE)air_style(target = ".", verbose = FALSE)
target |
single file target for formatting |
verbose |
Run in verbose mode |
## Not run: air_style() ## End(Not run)## Not run: air_style() ## End(Not run)
Checks if the required environment variables for connecting to
Databricks are set, and if the odbc package version is sufficient.
check_databricks_odbc()check_databricks_odbc()
Prints instructions for fixing common problems to the console.
TRUE if the connection is set up correctly, FALSE otherwise.
Other databricks:
write_df_to_delta()
check_databricks_odbc()check_databricks_odbc()
Adds separating commas to big numbers. If a value is not numeric it will return the value unchanged and as a string.
comma_sep(number, nsmall = 0L)comma_sep(number, nsmall = 0L)
number |
number to be comma separated |
nsmall |
minimum number of digits to the right of the decimal point |
string
comma_sep(100) comma_sep(1000) comma_sep(3567000)comma_sep(100) comma_sep(1000) comma_sep(3567000)
A lookup of ONS geography country names and codes, as well as some custom DfE names and codes. This is used as the definitive list for the screening of open data before it is published by the DfE.
countriescountries
countriesA data frame with 10 rows and 2 columns:
Country name
Country code
curated by [email protected], ONS codes sourced from https://geoportal.statistics.gov.uk/search?q=countries%20names%20and%20codes
Creates a pre-populated project for DfE R
create_project( path, init_renv = TRUE, include_structure_for_pkg = FALSE, create_publication_proj = FALSE, include_github_gitignore, ... )create_project( path, init_renv = TRUE, include_structure_for_pkg = FALSE, create_publication_proj = FALSE, include_github_gitignore, ... )
path |
Path of the new project |
init_renv |
Boolean; initiate renv in the project. Default is set to true. |
include_structure_for_pkg |
Boolean; Additional folder structure for package development. Default is set to false. |
create_publication_proj |
Boolean; Should the folder structure be for a publication project. Default is set to false. |
include_github_gitignore |
Boolean; Should a strict .gitignore file for GitHub be created. |
... |
Additional parameters, currently not used |
This function creates a new project with a custom folder structure.
It sets up the R/ folder and template function scripts,
initializes {testthat} and adds tests for the function scripts,
builds the core project structure, creates a .gitignore file,
creates a readme, and optionally initializes {renv}.
No return values, the project and its contents are created
## Not run: # Call the function to create a new project dfeR::create_project( path = "C:/path/to/your/new/project", init_renv = TRUE, include_structure_for_pkg = FALSE, create_publication_proj = FALSE, include_github_gitignore = TRUE ) ## End(Not run)## Not run: # Call the function to create a new project dfeR::create_project( path = "C:/path/to/your/new/project", init_renv = TRUE, include_structure_for_pkg = FALSE, create_publication_proj = FALSE, include_github_gitignore = TRUE ) ## End(Not run)
Fetch a data frame of all Westminster Parliamentary Constituencies for a given year and country based on the dfeR::geo_hierarchy file.
fetch_pcons(year = "All", countries = "All")fetch_pcons(year = "All", countries = "All")
year |
year to filter the locations to, default is "All", options of 2017, 2019, 2020, 2021, 2022", 2023, 2024, 2025 |
countries |
vector of desired countries to filter the locations to, default is "All", or can be a vector with options of "England", "Scotland", "Wales" or "Northern Ireland" |
data frame of unique location names and codes
# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()
Fetch a data frame of all countries based on the dfeR::countries file.
fetch_countries()fetch_countries()
data frame of unique location names and codes
Other fetch_locations:
fetch_lads(),
fetch_las(),
fetch_mayoral(),
fetch_regions(),
fetch_wards()
# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()
Fetch a data frame of all local authority districts for a given year and country based on the dfeR::geo_hierarchy file.
fetch_lads(year = "All", countries = "All")fetch_lads(year = "All", countries = "All")
year |
year to filter the locations to, default is "All", options of 2017, 2019, 2020, 2021, 2022", 2023, 2024, 2025 |
countries |
vector of desired countries to filter the locations to, default is "All", or can be a vector with options of "England", "Scotland", "Wales" or "Northern Ireland" |
data frame of unique location names and codes
Other fetch_locations:
fetch_countries(),
fetch_las(),
fetch_mayoral(),
fetch_regions(),
fetch_wards()
# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()
Fetch a data frame of all local authorities for a given year and country based on the dfeR::geo_hierarchy file.
fetch_las(year = "All", countries = "All")fetch_las(year = "All", countries = "All")
year |
year to filter the locations to, default is "All", options of 2017, 2019, 2020, 2021, 2022", 2023, 2024, 2025 |
countries |
vector of desired countries to filter the locations to, default is "All", or can be a vector with options of "England", "Scotland", "Wales" or "Northern Ireland" |
data frame of unique location names and codes
Other fetch_locations:
fetch_countries(),
fetch_lads(),
fetch_mayoral(),
fetch_regions(),
fetch_wards()
# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()
Fetch a data frame of all mayoral combined authorities for a given year and country based on the dfeR::geo_hierarchy file.
fetch_mayoral(year = "All")fetch_mayoral(year = "All")
year |
year to filter the locations to, default is "All", options of 2017, 2019, 2020, 2021, 2022", 2023, 2024, 2025 |
Note that mayoral combined authorities only exist for England.
Mayoral combined authorities are also known as English Devolved Areas, as we add in the Greater London Authority to the combined authority lookup published by ONS.
data frame of unique location names and codes
Other fetch_locations:
fetch_countries(),
fetch_lads(),
fetch_las(),
fetch_regions(),
fetch_wards()
# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()
Fetch a data frame of all regions based on the dfeR::regions file.
fetch_regions()fetch_regions()
data frame of unique location names and codes
Other fetch_locations:
fetch_countries(),
fetch_lads(),
fetch_las(),
fetch_mayoral(),
fetch_wards()
# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()
Fetch a data frame of all wards for a given year and country based on the dfeR::geo_hierarchy file.
fetch_wards(year = "All", countries = "All")fetch_wards(year = "All", countries = "All")
year |
year to filter the locations to, default is "All", options of 2017, 2019, 2020, 2021, 2022", 2023, 2024, 2025 |
countries |
vector of desired countries to filter the locations to, default is "All", or can be a vector with options of "England", "Scotland", "Wales" or "Northern Ireland" |
data frame of unique location names and codes
Other fetch_locations:
fetch_countries(),
fetch_lads(),
fetch_las(),
fetch_mayoral(),
fetch_regions()
# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()# Using head() to show only top 5 rows for examples head(fetch_wards()) head(fetch_pcons()) head(fetch_pcons(2023)) head(fetch_pcons(countries = "Scotland")) head(fetch_pcons(year = 2023, countries = c("England", "Wales"))) head(fetch_mayoral()) fetch_lads(2024, "Wales") fetch_las(2022, "Northern Ireland") # The following have no specific years available and return all values fetch_regions() fetch_countries()
This function formats academic year variables for reporting purposes. It will convert an academic year input from 201516 format to 2015/16 format.
format_ay(year)format_ay(year)
year |
Academic year |
It accepts both numerical and character arguments.
Character vector of formatted academic year
Other format:
format_ay_reverse(),
format_fy(),
format_fy_reverse()
format_ay(201617) format_ay("201617")format_ay(201617) format_ay("201617")
This function converts academic year variables back into 201617 format.
format_ay_reverse(year)format_ay_reverse(year)
year |
Academic year |
It accepts character arguments.
Unformatted 6 digit year as string
Other format:
format_ay(),
format_fy(),
format_fy_reverse()
format_ay_reverse("2016/17")format_ay_reverse("2016/17")
This function formats financial year variables for reporting purposes. It will convert an year input from 201516 format to 2015-16 format.
format_fy(year)format_fy(year)
year |
Financial year |
It accepts both numerical and character arguments.
Character vector of formatted financial year
Other format:
format_ay(),
format_ay_reverse(),
format_fy_reverse()
format_fy(201617) format_fy("201617")format_fy(201617) format_fy("201617")
This function converts financial year variables back into 201617 format.
format_fy_reverse(year)format_fy_reverse(year)
year |
Financial year |
It accepts character arguments.
Unformatted 6 digit year as string
Other format:
format_ay(),
format_ay_reverse(),
format_fy()
format_fy_reverse("2016-17")format_fy_reverse("2016-17")
A lookup showing the hierarchy of ward to Westminster parliamentary constituency to local authority district to local authority to combined mayoral authority to region to country for years 2017, 2019, 2020, 2021, 2022, 2023, 2024 and 2025.
geo_hierarchygeo_hierarchy
geo_hierarchyA data frame with 26,057 rows and 17 columns:
First year in the lookups that we see this location
Last year in the lookups that we see this location
Ward name
Parliamentary constituency name
Local authority district name
Local authority name
Mayoral authority name
Region name
Country name
9 digit ward code
9 digit westminster constituency code
9 digit local authority district code
old 3 digit local authority code
9 digit local authority code
9 digit combined authority code
9 digit region code
9 digit country code
Note that combined mayoral authorities only exist in England, and we use
english_devolved_area_name and english_devolved_area_code to refer
to mayoral authorities in line with the standards set in DfE official
statistics. Greater London Authority is not included in the ONS combined
authority lookup, we have added that in for all local authority districts
with codes that start with E090....
Changes we've made to the original lookup:
The original lookup from ONS uses the Upper Tier Local Authority, we then update this so that where there is a metropolitan local authority we use the local authority district as the local authority to match how DfE publish data for local authorities.
We have noticed that in the 2017 version, the Glasgow East constituency had a code of S1400030 instead of the usual S14000030, we've assumed this was an error and have change this in our data so that Glasgow East is S14000030 in 2017.
We have joined on regions using the Ward to LAD to County to Region file.
We have joined on countries based on the E / N / S / W at the start of codes.
Scotland had no published regions in 2017, so given the rest of the years have Scotland as the region, we've forced that in for 2017 too to complete the data set.
We've added the Greater London Authority as the overarching mayoral
authority (English devolved area) for all Local authority districts with
codes that start E090....
https://geoportal.statistics.gov.uk/search?tags=lup_wd_pcon_lad_utla https://geoportal.statistics.gov.uk/search?q=lup_wd_lad_cty_rgn_gor_ctry https://geoportal.statistics.gov.uk/search?tags=LUP_LAD_CAUTH https://get-information-schools.service.gov.uk/Guidance/LaNameCodes and https://tinyurl.com/EESScreenerLAs
Potential names for geography and time columns in line with the ones used for the explore education statistics data screener.
geog_time_identifiersgeog_time_identifiers
geog_time_identifiersA character vector with 38 potential column names in snake case format.
curated by [email protected]. Guidance on time and geography data.
This function cleans a SQL script, ready for using within R in the DfE.
get_clean_sql(filepath, additional_settings = FALSE)get_clean_sql(filepath, additional_settings = FALSE)
filepath |
path to a SQL script |
additional_settings |
TRUE or FALSE boolean for the addition of settings at the start of the SQL script |
Cleaned string containing SQL query
# This assumes you have already set up a database connection # and that the filepath for the function exists # For more details see the vignette on connecting to SQL # Pull a cleaned version of the SQL file into R if (file.exists("your_script.sql")) { sql_query <- get_clean_sql("your_script.sql") }# This assumes you have already set up a database connection # and that the filepath for the function exists # For more details see the vignette on connecting to SQL # Pull a cleaned version of the SQL file into R if (file.exists("your_script.sql")) { sql_query <- get_clean_sql("your_script.sql") }
Helper function that takes a data set id and parameters to query and parse data from the ONS Open Geography API. Technically uses a POST request rather than a GET request.
get_ons_api_data( data_id, query_params = list(where = "1=1", outFields = "*", outSR = "4326", f = "json"), batch_size = 200, verbose = TRUE )get_ons_api_data( data_id, query_params = list(where = "1=1", outFields = "*", outSR = "4326", f = "json"), batch_size = 200, verbose = TRUE )
data_id |
the id of the data set to query, can be found from the Open Geography Portal |
query_params |
query parameters to pass into the API, see the ESRI documentation for more information on query parameters - ESRI Query (Feature Service/Layer) |
batch_size |
the number of rows per query. This is 250 by default, if you hit errors then try lowering this. The API has a limit of 1000 to 2000 rows per query, and in truth, the actual limit for our method is lower as every ObjectId queried is pasted into the query URL so for every row included in the batch, and especial if those Id's go into the 1,000s or 10,000s they will increase the size of the URL and risk hitting the limit. |
verbose |
TRUE or FALSE boolean. TRUE by default. FALSE will turn off the messages to the console that update on what the function is doing |
It does a pre-query to understand the ObjectIds for the query you want, and then does a query to retrieve those Ids directly in batches before then stacking the whole thing back together to work around the row limits for a single query.
On the Open Geography Portal, find the data set you're interested in and then use the query explorer to find the information for the query.
This function has been mostly developed for ease of use for dfeR maintainers if you're interested in getting data from the Open Geography Portal more widely you should also look at the boundr package.
parsed data.frame of geographic names and codes
# Fetch everything from a data set dfeR::get_ons_api_data(data_id = "LAD23_RGN23_EN_LU") # Specify the columns you want dfeR::get_ons_api_data( "RGN_DEC_2023_EN_NC", query_params = list( where = "1=1", outFields = "RGN23CD,RGN23NM", outSR = 4326, f = "json" ) )# Fetch everything from a data set dfeR::get_ons_api_data(data_id = "LAD23_RGN23_EN_LU") # Specify the columns you want dfeR::get_ons_api_data( "RGN_DEC_2023_EN_NC", query_params = list( where = "1=1", outFields = "RGN23CD,RGN23NM", outSR = 4326, f = "json" ) )
A lookup of ONS geography shorthands and their respective column names in line with DfE open data standards.
ons_geog_shorthandsons_geog_shorthands
ons_geog_shorthandsA data frame with 9 rows and 3 columns:
ONS shorthands used in their lookup files
DfE names for geography name columns
DfE names for geography code columns
GOR (Government Office Region) was the predecessor to RGN.
curated by [email protected]
Converts a raw file size from bytes to a more readable format.
pretty_filesize(filesize)pretty_filesize(filesize)
filesize |
file size in bytes |
Designed to be used in conjunction with the file.size() function in base R.
Presents in kilobytes, megabytes or gigabytes.
Shows as bytes until 1 KB, then kilobytes up to 1 MB, then megabytes until 1GB, then it will show as gigabytes for anything larger.
Rounds the end result to 2 decimal places.
Using base 10 (decimal), so 1024 bytes is 1,024 KB.
string containing prettified file size
Other prettying:
pretty_num(),
pretty_num_table(),
pretty_time(),
pretty_time_taken()
pretty_filesize(2) pretty_filesize(549302) pretty_filesize(9872948939) pretty_filesize(1) pretty_filesize(1000) pretty_filesize(1000^2) pretty_filesize(10^9)pretty_filesize(2) pretty_filesize(549302) pretty_filesize(9872948939) pretty_filesize(1) pretty_filesize(1000) pretty_filesize(1000^2) pretty_filesize(10^9)
Uses as.numeric() to force a numeric value and then formats prettily
for easy presentation in console messages, reports, or dashboards.
This rounds to 0 decimal places by default, and adds in comma separators.
Expect that this will commonly be used for adding the pound symbol, the percentage symbol, or to have a +/- prefixed based on the value.
If applying over multiple or unpredictable values and you want to preserve
a non-numeric symbol such as "x" or "c" for data not available, use the
ignore_na = TRUE argument to return those values unaffected.
If you want to customise what NA values are returned as, use the alt_na
argument.
This function silences the warning around NAs being introduced by coercion.
pretty_num( value, prefix = "", gbp = FALSE, suffix = "", dp = 0, ignore_na = FALSE, alt_na = FALSE, nsmall = NULL, dynamic_dp_value = NULL, abbreviate = TRUE )pretty_num( value, prefix = "", gbp = FALSE, suffix = "", dp = 0, ignore_na = FALSE, alt_na = FALSE, nsmall = NULL, dynamic_dp_value = NULL, abbreviate = TRUE )
value |
value to be prettified |
prefix |
prefix for the value, if "+/-" then it will automatically assign + or - based on the value |
gbp |
whether to add the pound symbol or not, defaults to not |
suffix |
suffix for the value, e.g. "%" |
dp |
number of decimal places to round to, 0 by default. |
ignore_na |
whether to skip function for strings that can't be converted and return original value |
alt_na |
alternative value to return in place of NA, e.g. "x" |
nsmall |
minimum number of digits to the right of the decimal point.
If NULL, the value of |
dynamic_dp_value |
Integer. Default = NULL.
Overrides the |
abbreviate |
whether to abbreviate large numbers to nearest million (where 1e6 <= value < 1e9) or billion (where value >= 1e9). |
string featuring prettified value
comma_sep() round_five_up() as.numeric()
Other prettying:
pretty_filesize(),
pretty_num_table(),
pretty_time(),
pretty_time_taken()
# On individual values pretty_num(5789, gbp = TRUE) pretty_num(564, prefix = "+/-") pretty_num(567812343223, gbp = TRUE, prefix = "+/-") pretty_num(11^9, gbp = TRUE, dp = 3) pretty_num(-11^8, gbp = TRUE, dp = -1) pretty_num(43.3, dp = 1, nsmall = 2) pretty_num("56.089", suffix = "%") pretty_num("x") pretty_num("x", ignore_na = TRUE) pretty_num("nope", alt_na = "x") pretty_num(7.8e9, abbreviate = FALSE) # dynamic_dp_value enabled for a billion value not divisible by 10 pretty_num(3e9, dynamic_dp_value = 2) # dynamic_dp_value enabled for a billion value divisible by 10 pretty_num(10e9, dynamic_dp_value = 2) # dynamic_dp_value enabled for a million value not divisible by 10 pretty_num(3e6, dynamic_dp_value = 3) # dynamic_dp_value enabled for a million value divisible by 10 pretty_num(10e6, dynamic_dp_value = 3) # dynamic_dp_value enabled with GBP and suffix pretty_num(1.5e9, gbp = TRUE, suffix = "%", dynamic_dp_value = 1 ) #' # Applied over an example vector vector <- c(3998098008, -123421421, "c", "x") pretty_num(vector) pretty_num(vector, prefix = "+/-", gbp = TRUE) # Return original values if NA pretty_num(vector, ignore_na = TRUE) # Return alternative value in place of NA pretty_num(vector, alt_na = "z")# On individual values pretty_num(5789, gbp = TRUE) pretty_num(564, prefix = "+/-") pretty_num(567812343223, gbp = TRUE, prefix = "+/-") pretty_num(11^9, gbp = TRUE, dp = 3) pretty_num(-11^8, gbp = TRUE, dp = -1) pretty_num(43.3, dp = 1, nsmall = 2) pretty_num("56.089", suffix = "%") pretty_num("x") pretty_num("x", ignore_na = TRUE) pretty_num("nope", alt_na = "x") pretty_num(7.8e9, abbreviate = FALSE) # dynamic_dp_value enabled for a billion value not divisible by 10 pretty_num(3e9, dynamic_dp_value = 2) # dynamic_dp_value enabled for a billion value divisible by 10 pretty_num(10e9, dynamic_dp_value = 2) # dynamic_dp_value enabled for a million value not divisible by 10 pretty_num(3e6, dynamic_dp_value = 3) # dynamic_dp_value enabled for a million value divisible by 10 pretty_num(10e6, dynamic_dp_value = 3) # dynamic_dp_value enabled with GBP and suffix pretty_num(1.5e9, gbp = TRUE, suffix = "%", dynamic_dp_value = 1 ) #' # Applied over an example vector vector <- c(3998098008, -123421421, "c", "x") pretty_num(vector) pretty_num(vector, prefix = "+/-", gbp = TRUE) # Return original values if NA pretty_num(vector, ignore_na = TRUE) # Return alternative value in place of NA pretty_num(vector, alt_na = "z")
dfeR::pretty_num().You can format number and character values in a data frame
by passing arguments to dfeR::pretty_num().
Use parameters include_columns or exclude_columns
to specify columns for formatting.
pretty_num_table(data, include_columns = NULL, exclude_columns = NULL, ...)pretty_num_table(data, include_columns = NULL, exclude_columns = NULL, ...)
data |
A data frame containing the columns to be formatted. |
include_columns |
A character vector specifying which columns to format.
If |
exclude_columns |
A character vector specifying columns to exclude
from formatting.
If |
... |
Additional arguments passed to |
The function first checks if any columns are specified for inclusion
via include_columns.
If none are provided, it checks if columns are specified for exclusion
via exclude_columns.
If neither is specified, all columns in the data frame are formatted.
A data frame with columns formatted using dfeR::pretty_num().
Other prettying:
pretty_filesize(),
pretty_num(),
pretty_time(),
pretty_time_taken()
# Example data frame df <- data.frame( a = c(1.234, 5.678, 9.1011), b = c(10.1112, 20.1314, 30.1516), c = c("A", "B", "C") ) # Apply formatting to all columns pretty_num_table(df, dp = 2) # Apply formatting to only selected columns pretty_num_table(df, include_columns = c("a"), dp = 2) # Apply formatting to all columns except specified ones pretty_num_table(df, exclude_columns = c("b"), dp = 2) # Apply formatting to all columns except specified ones and # provide alternative value for NAs pretty_num_table(df, alt_na = "[z]", exclude_columns = c("b"), dp = 2)# Example data frame df <- data.frame( a = c(1.234, 5.678, 9.1011), b = c(10.1112, 20.1314, 30.1516), c = c("A", "B", "C") ) # Apply formatting to all columns pretty_num_table(df, dp = 2) # Apply formatting to only selected columns pretty_num_table(df, include_columns = c("a"), dp = 2) # Apply formatting to all columns except specified ones pretty_num_table(df, exclude_columns = c("b"), dp = 2) # Apply formatting to all columns except specified ones and # provide alternative value for NAs pretty_num_table(df, alt_na = "[z]", exclude_columns = c("b"), dp = 2)
Convert seconds into a human readable format
pretty_time(seconds)pretty_time(seconds)
seconds |
number of seconds to prettify |
Recognises when to present as:
seconds
minutes and seconds
hours, minutes and seconds
It will show seconds until 119 seconds, then minutes until 119 minutes, then hours. It doesn't do days or higher yet, but could be adapted to do so if there's demand.
string containing the 'pretty' time
comma_sep() round_five_up() as.POSIXct()
Other prettying:
pretty_filesize(),
pretty_num(),
pretty_num_table(),
pretty_time_taken()
pretty_time(1) pretty_time(8) pretty_time(888) pretty_time(88888888) pretty_time(c(60, 2, 88, 88888888))pretty_time(1) pretty_time(8) pretty_time(888) pretty_time(88888888) pretty_time(c(60, 2, 88, 88888888))
Converts a start and end value to a readable time format.
pretty_time_taken(start_time, end_time)pretty_time_taken(start_time, end_time)
start_time |
start time readable by as.POSIXct |
end_time |
end time readable by as.POSIXct |
Designed to be used with Sys.time() when tracking start and end times.
Shows as seconds up until 119 seconds, then minutes until 119 minutes, then hours for anything larger.
Input start and end times must be convertible to POSIXct format.
string containing prettified elapsed time
comma_sep() round_five_up() as.POSIXct()
Other prettying:
pretty_filesize(),
pretty_num(),
pretty_num_table(),
pretty_time()
pretty_time_taken( "2024-03-23 07:05:53 GMT", "2024-03-23 12:09:56 GMT" ) # Track the start and end time of a process start <- Sys.time() Sys.sleep(0.1) end <- Sys.time() # Use this function to present it prettily pretty_time_taken(start, end)pretty_time_taken( "2024-03-23 07:05:53 GMT", "2024-03-23 12:09:56 GMT" ) # Track the start and end time of a process start <- Sys.time() Sys.sleep(0.1) end <- Sys.time() # Use this function to present it prettily pretty_time_taken(start, end)
A lookup of ONS geography region names and codes for England. In their lookups Northern Ireland, Scotland and Wales are regions.
regionsregions
regionsA data frame with 16 rows and 2 columns:
Region name
Region code
Also included inner and outer London county split as DfE frequently publish those as regions, as well as some custom DfE names and codes. This is used as the definitive list for the screening of open data before it is published by the DfE.
curated by [email protected], ONS codes sourced from https://geoportal.statistics.gov.uk/search?q=NAC_RGN
Round any number to a specified number of places, with 5's being rounded up.
round_five_up(number, dp = 0)round_five_up(number, dp = 0)
number |
number to be rounded |
dp |
number of decimal places to round to, default is 0 |
Rounds to 0 decimal places by default.
You can use a negative value for the decimal places. For example: -1 would round to the nearest 10 -2 would round to the nearest 100 and so on.
This is as an alternative to round in base R, which uses a bankers round. For more information see the round() documentation.
Rounded number
# No dp set round_five_up(2485.85) # With dp set round_five_up(2485.85, 2) round_five_up(2485.85, 1) round_five_up(2485.85, 0) round_five_up(2485.85, -1) round_five_up(2485.85, -2)# No dp set round_five_up(2485.85) # With dp set round_five_up(2485.85, 2) round_five_up(2485.85, 1) round_five_up(2485.85, 0) round_five_up(2485.85, -1) round_five_up(2485.85, -2)
Quick expansion to the message() function aimed for use in functions for
an easy addition of a global verbose TRUE / FALSE argument to toggle the
messages on or off
toggle_message(..., verbose)toggle_message(..., verbose)
... |
any message you would normally pass into |
verbose |
logical, usually a variable passed from the function you are using this within |
No return value, called for side effects
# Usually used in a function my_function <- function(count_fingers, verbose) { toggle_message("I have ", count_fingers, " fingers", verbose = verbose) fingers_thumbs <- count_fingers + 2 toggle_message("I have ", fingers_thumbs, " digits", verbose = verbose) } my_function(5, verbose = FALSE) my_function(5, verbose = TRUE) # Can be used in isolation toggle_message("I want the world to read this!", verbose = TRUE) toggle_message("I ain't gonna show this message!", verbose = FALSE) count_fingers <- 5 toggle_message("I have ", count_fingers, " fingers", verbose = TRUE)# Usually used in a function my_function <- function(count_fingers, verbose) { toggle_message("I have ", count_fingers, " fingers", verbose = verbose) fingers_thumbs <- count_fingers + 2 toggle_message("I have ", fingers_thumbs, " digits", verbose = verbose) } my_function(5, verbose = FALSE) my_function(5, verbose = TRUE) # Can be used in isolation toggle_message("I want the world to read this!", verbose = TRUE) toggle_message("I ain't gonna show this message!", verbose = FALSE) count_fingers <- 5 toggle_message("I have ", count_fingers, " fingers", verbose = TRUE)
wd_pcon_lad_la_rgn_ctrywd_pcon_lad_la_rgn_ctry
wd_pcon_lad_la_rgn_ctryA data frame with 24,629 rows and 14 columns:
First year in the lookups that we see this location
Last year in the lookups that we see this location
Ward name
Parliamentary constituency name
Local authority district name
Local authority name
Region name
Country name
9 digit ward code
9 digit westminster constituency code
9 digit local authority district code
old 3 digit local authority code
9 digit local authority code
9 digit region code
9 digit country code
wd_pcon_lad_la_rgn_ctry has been superseded by geo_hierarchy and will no
longer receive any updates. It will be removed in the next major version.
geo_hierarchy contains all the columns in this data set plus more, and is
what we will be maintaining moving forwards. wd_pcon_lad_la_rgn_ctry will
be removed in the next major release of the package.
A lookup showing the hierarchy of ward to Westminster parliamentary constituency to local authority district to local authority to region to country for years 2017, 2019, 2020, 2021, 2022, 2023 and 2024.
Changes we've made to the original lookup:
The original lookup from ONS uses the Upper Tier Local Authority, we then update this so that where there is a metropolitan local authority we use the local authority district as the local authority to match how DfE publish data for local authorities.
We have noticed that in the 2017 version, the Glasgow East constituency had a code of S1400030 instead of the usual S14000030, we've assumed this was an error and have change this in our data so that Glasgow East is S14000030 in 2017.
We have joined on regions using the Ward to LAD to County to Region file.
We have joined on countries based on the E / N / S / W at the start of codes.
Scotland had no published regions in 2017, so given the rest of the years have Scotland as the region, we've forced that in for 2017 too to complete the data set.
https://geoportal.statistics.gov.uk/search?tags=lup_wd_pcon_lad_utla https://geoportal.statistics.gov.uk/search?q=lup_wd_lad_cty_rgn_gor_ctry https://get-information-schools.service.gov.uk/Guidance/LaNameCodes and https://tinyurl.com/EESScreenerLAs
This function writes a large R data frame or tibble (df) to a Delta Lake
table (target_table) on Databricks using Databricks Volumes and the
COPY INTO SQL command.
write_df_to_delta( df, target_table, db_conn, column_types_schema = NULL, volume_dir, copy_options = "'mergeSchema' = 'true'", overwrite_table = FALSE, chunk_size_bytes = 5 * 1024^3 )write_df_to_delta( df, target_table, db_conn, column_types_schema = NULL, volume_dir, copy_options = "'mergeSchema' = 'true'", overwrite_table = FALSE, chunk_size_bytes = 5 * 1024^3 )
df |
A |
target_table |
A character string specifying the name of the Delta
table.
Can be unqualified ( |
db_conn |
A valid DBI connection object to Databricks. This connection is used to interact with the Delta table. |
column_types_schema |
An optional Arrow Schema object (created via
Type mapping reference (Arrow schema field → Spark SQL type):
Important notes for schema use:
|
volume_dir |
A character string specifying the path to the target Databricks Volume where the Parquet file will be uploaded. |
copy_options |
A character string specifying options for the
|
overwrite_table |
Logical; if |
chunk_size_bytes |
An integer specifying the size of each data chunk in bytes. This is used to split the data frame into smaller chunks for uploading. Defaults to 5GB. |
The function performs the following steps:
Optionally overwrites the target table.
Uploads the data as Parquet file(s) to a specified Databricks Volume.
Executes a COPY INTO command to load the file(s) into Delta Lake.
Deletes the temporary file(s) from the Volume after loading is complete.
Data is chunked into segments during upload to accommodate the Databricks REST API limit of 5 GB per single file upload.
Invisibly returns the result of the COPY INTO execution.
To use this function, users must ensure that they have appropriate Databricks permissions:
Catalog/schema: USE CATALOG on the target catalog
and USE SCHEMA on the target schema.
Table creation: CREATE TABLE on the target schema
(if overwrite_table = TRUE) or MODIFY and SELECT on
the existing table.
Volume access: READ VOLUME and WRITE VOLUME
on the Databricks Volume used for staging (specified in volume_dir).
Moreover, this function requires valid .Renviron variables for
authentication, specifically DATABRICKS_TOKEN and DATABRICKS_HOST.
Other databricks:
check_databricks_odbc()
## Not run: # Setup connection using environment variables con <- DBI::dbConnect(odbc::databricks(), httpPath = Sys.getenv("DATABRICKS_SQL_PATH") ) write_df_to_delta( df = my_data, target_table = "catalog.schema.my_table", db_conn = con, volume_dir = "/Volumes/catalog/schema", overwrite_table = TRUE ) ## End(Not run)## Not run: # Setup connection using environment variables con <- DBI::dbConnect(odbc::databricks(), httpPath = Sys.getenv("DATABRICKS_SQL_PATH") ) write_df_to_delta( df = my_data, target_table = "catalog.schema.my_table", db_conn = con, volume_dir = "/Volumes/catalog/schema", overwrite_table = TRUE ) ## End(Not run)
NA values in tablesReplaces NA values in tables except for ones in time and geography
columns that must be included in DfE official statistics.
Guidance on our Open Data Standards.
z_replace(data, replacement_alt = NULL, exclude_columns = NULL)z_replace(data, replacement_alt = NULL, exclude_columns = NULL)
data |
name of the table that you want to replace NA values in |
replacement_alt |
optional - if you want the NA replacement value to be different to "z" |
exclude_columns |
optional - additional columns to exclude from
NA replacement.
Column names that match ones found in |
Names of geography and time columns that are used in this function can be
found in dfeR::geog_time_identifiers.
table with "z" or an alternate replacement value instead of NA
values for columns that are not for time or geography.
# Create a table for the example df <- data.frame( time_period = c(2022, 2022, 2022), time_identifier = c("Calendar year", "Calendar year", "Calendar year"), geographic_level = c("National", "Regional", "Regional"), country_code = c("E92000001", "E92000001", "E92000001"), country_name = c("England", "England", "England"), region_code = c(NA, "E12000001", "E12000002"), region_name = c(NA, "North East", "North West"), mystery_count = c(42, 25, NA) ) z_replace(df) # Use a different replacement value z_replace(df, replacement_alt = "c")# Create a table for the example df <- data.frame( time_period = c(2022, 2022, 2022), time_identifier = c("Calendar year", "Calendar year", "Calendar year"), geographic_level = c("National", "Regional", "Regional"), country_code = c("E92000001", "E92000001", "E92000001"), country_name = c("England", "England", "England"), region_code = c(NA, "E12000001", "E12000002"), region_name = c(NA, "North East", "North West"), mystery_count = c(42, 25, NA) ) z_replace(df) # Use a different replacement value z_replace(df, replacement_alt = "c")