This note will be presented at the annual PX-meeting in Skopje, North Macedonia,
November 7th, 2024
Title: pxmake - multilingual px files with R, 2024
PXWeb-logfiles
Unfortunately ‘PXWeb-out-of-the-box’ does not offer usage statistics of any kind. Most likely all PXWeb-based services regularly need to deliver management information as PXWeb is vital to the dissemination policy of many statistical offices . Good news is though, by default PXWeb does store information needed in logfiles.
The tricky part is ironically transfering the information to px-files. But as this example will demonstrate, it is not all that difficult, when using pxmake. To serve as learning material this example use R, free to all, to demonstrate how it can be done.
And do not forget: the easiest way to disseminate information is with PXWeb/PWin.
ALL code below, runs in a project called: ‘log’
click to see/hide list of packages used
# Standard R-packagesif(!require("tidyverse")) {install.packages("tidyverse")}library(tidyverse)if(!require("lubridate")) {install.packages("lubridate")}library(lubridate)if(!require("janitor")) {install.packages("janitor")}library(janitor)if(!require("dplyr")) {install.packages("dplyr")}library(dplyr)if(!require("RCurl")) {install.packages("RCurl")}library(RCurl)if(!require("arrow")) {install.packages("arrow")}library(arrow)if(!require("xml2")) {install.packages("xml2")}library(xml2)# 2 get subjects used in Statbank Greenlandif(!require("httr")) {install.packages("httr")}library(httr)if(!require("jsonlite")) {install.packages("jsonlite")}library(jsonlite)# Commands used to setup pxmake ------------------------------------------------------------# install.packages('devtools')# devtools::install_github('StatisticsGreenland/pxmake')library(pxmake)library(statgl)
Log Types
PXWeb has 3 different log-files apart from the normal web-server logs. They are stored on the server in:
~/PxWeb/logs, as:
api.log*
pxweb.log*
stats.log*
Data Preparation
If you do not have R installed on the web-server, different strategies can be followed.
Set permissions on server to allow the log files to be read
Copy all log-files, you want to analyse, to a file-server
Concatinate all same type log-files to one as .txt before copying
@3. On a windows server this can be done with the DOS copy command
Log on the server, run the cmd prompt and enter the dos copy command:cd c:/inetpub/wwwroot/PxWeb/logs/ copy /y stats*.* stats.txt cd c:/inetpub/wwwroot/PxWeb/logs/ copy /y api*.* api.txt cd c:/inetpub/wwwroot/PxWeb/logs/ copy /y pxweb*.* pxweb.txt :: Concatenate all logs matching stats*.* into statsall.txtcopy /y "C:\inetpub\wwwroot\PxWeb\logs\stats*.*""C:\inetpub\wwwroot\home\pxmake\statsall.txt":: Define source and destination pathsset "source=C:\inetpub\wwwroot\home\pxmake\statsall.txt"set "destination=C:\inetpub\wwwroot\home\pxmake\statsall.zip":: Compress the statsall.txt into statsall.zip using PowerShellpowershell -command "Compress-Archive -Path '%source%' -DestinationPath '%destination%' -Force"echo Zip file created at %destination%\
On https://bank.stat.gl a Task was setup to run every night, concatinating, and zipping the concatinated file as statsall.zip in the pxmake subfolder:
stats.log
First let’s look at stats.log, saved to one file pr day. One row is added to the log-file for each table being extracted or manipulated by Pxweb.
Downloading and reading the log file can be time consuming, so only run next chunck, when needed. If a zipped version exist it might not be a problem.
The ‘stat_log_data’ dataframe is saved to disk in the parquet format, for minimal footprint. In this example both Rds (Rdataset) and parquet is written to the filesystem. The parquet is 5.4 Mb down from 69 Mb in Rds.
As pxmake uses a variable called language, the language variable in the statlog file is renamed as ‘Lang’ in this step
This is too lengtly, so the chunck option eval is set to FALSE
But this is the code and the result is saved to a Parquet file for later use
# URL and destination file pathurl <-"https://stat.gl/pxmake/statsall.zip"destfile <-"statsall.zip"# Download filedownload.file(url, destfile, mode ="wb")unzip("statsall.zip", exdir =".")log_data <-readLines("statsall.txt")# Convert log_data into a data framelog_df <-data.frame(log_entry = log_data, stringsAsFactors =FALSE)# Extract relevant components using regular expressionsstat_log_data <- log_df %>%mutate(Timestamp =str_extract(log_entry, "^[^\\[]+"), LogLevel =str_extract(log_entry, "(?<=\\[)[^\\]]+"), Context =str_extract(log_entry, "(?<=Context=)[^,]+"), UserId =str_extract(log_entry, "(?<=UserId=)[^,]+"), Lang =str_extract(log_entry, "(?<=Language=)[^,]+"), # Language is used by pxmakeDatabase =str_extract(log_entry, "(?<=Database=)[^,]+"), ActionType =str_extract(log_entry, "(?<=ActionType=)[^,]+"), ActionName =str_extract(log_entry, "(?<=ActionName=)[^,]+"), TableId =sub("\\.[pP][xX]$", "", basename(str_extract(log_entry, "(?<=TableId=)[^,]+"))),NumberOfCells =as.integer(str_extract(log_entry, "(?<=NumberOfCells=)[0-9]+")), NumberOfContents =as.integer(str_extract(log_entry, "(?<=NumberOfContents=)[0-9]+")) ) %>%select(-log_entry) %>%clean_names()write_rds(stat_log_data,"statlog.Rds")write_parquet(stat_log_data,"statlog.parquet")
api log
Coding with chatGPT:
i want load data from a log file into a dataframe in r. some lines have errors:
added a number of lines from log files
Also I used ip to country lookup from https://lite.ip2location.com/ and saved dataframe toapilog.parquet
# code-fold: true# code-summary: "code"# Step 1: Read in the log filelog_lines <-readLines("api.txt")# Step 2: Filter out lines that contain structured log informationlog_data <- log_lines %>%# Filter lines with date pattern; modify if needed based on your log formatgrep("^\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}", ., value =TRUE) %>%# Create a data framedata.frame(line = ., stringsAsFactors =FALSE) %>%# Extract structured componentsmutate(datetime =str_extract(line, "^\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}"),thread_id =str_extract(line, "\\[\\d+\\]") %>%str_remove_all("\\[|\\]"),level =str_extract(line, "\\b(INFO|ERROR|WARN|DEBUG)\\b"),url =str_extract(line, "url=[^,]+") %>%str_remove("url="),type =str_extract(line, "type=[^,]+") %>%str_remove("type="),caller =str_extract(line, "caller=[^,]+") %>%str_remove("caller="),cached =str_extract(line, "cached=[^,]+") %>%str_remove("cached="),message =str_extract(line, "message=.*$") %>%str_remove("message=") ) %>%select(-line) # Remove the original line column# KanJegMonBrugeDenne24?# https://lite.ip2location.com/library(ip2location)# Initialize the IP2Location databaseip_db <-"IP2LOCATION/IP2LOCATION-LITE-DB1.BIN"ip_lookup <- ip2location::open(ip_db)# Ensure your function works with a single IP address at a timeget_country_code <-function(ip) { result <- ip2location::get_all(ip) %>%as.data.frame()list(country_short = result$country_short, country_long = result$country_long)}# Use rowwise() to apply get_country_code to each row individuallyl_data <- log_data %>%rowwise() %>%mutate(country_info =list(get_country_code(caller)),countrycode = country_info$country_short,countrytxt = country_info$country_long ) %>%ungroup() %>%select(-country_info) # Remove the intermediate list column if not neededwrite_parquet(l_data,"apilog.parquet")
pxweb log
# code-fold: true# code-summary: "code"# Define the path to the log filelog_file <-"pxweb.txt"# Define a function to parse log entriesparse_log_entry <-function(line) {# Regular expression to match the structured log format pattern <-"^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3}) \\[(\\d+)\\] (\\w+)\\s+([\\w\\-]+) - (.+)$"# Attempt to match the pattern match <-str_match(line, pattern)# If a match is found, return as a data frame row, else return NULLif (!is.na(match[1, 1])) {data.frame(timestamp = match[1, 2],thread_id =as.integer(match[1, 3]),log_level = match[1, 4],category = match[1, 5],details = match[1, 6],stringsAsFactors =FALSE ) } else {NULL }}# Read and parse the log filepxweb_log_data <-readLines(log_file) %>%# Apply the parsing function to each line and discard NULLslapply(parse_log_entry) %>%# Remove NULLs from the listcompact() %>%# Bind all entries into a data framebind_rows()write_parquet(pxweb_log_data,"pxweblog.parquet")
Let the fun begin
This section converts a dataframe to a px-object with minimal metadata.
For maximum satisfaction make sure Pxwin is installed on your pc (Windows only)
A statlog.txt file has been uploaded to www.stat.gl/pxmake in parquet format, despite the txt extension, to allow free play.
# read local log-parquetstat_log_data <-read_parquet("statlog.parquet")SBXSTAT <- stat_log_data %>%mutate(time =format(as.Date(timestamp), "%YM%m")) %>%filter(database =="Greenland") %>%mutate(subjectcode=substr(table_id,1,2)) %>%select(time,subjectcode,lang,action_type,action_name) %>%count(across(everything()))px(SBXSTAT) %>%px_add_totals(c("subjectcode","lang","action_name", "action_type")) %>%px_save("SBXSTAT.px") px(SBXSTAT) %>%px_elimination(tribble(~`variable-code`, ~elimination,'subjectcode', 'YES','lang', 'Total',"time","YES")) %>%px_add_totals(c("lang","action_name", "action_type")) %>%px_save("SBXSTAT.px") # Read the first 12 linesfirst_12_lines <-readLines("SBXSTAT.px", n =12)# View the linesfirst_12_lines
A PX-file will need a lot more metadata, to make sense to users and in multiple languages.
Texts instead of codes
StatBank Greenland names px-files with a 2 character prefix for subjects:
AR - ARbejdsmarked (labourmarket) BE - BEfolkning (population) etc…
All subjects/subjectcodes can be obtained via the api
# Function to fetch data from API and add language columnfetch_data <-function(url, language) { response <-GET(url) data <-content(response, as ="text") json_data <-fromJSON(data, flatten =TRUE) %>%as_tibble() json_data <- json_data %>%mutate(language = language)return(json_data)}# URLs for the APIsurl_en <-"https://bank.stat.gl/api/v1/en/Greenland"url_da <-"https://bank.stat.gl/api/v1/da/Greenland"url_kl <-"https://bank.stat.gl/api/v1/kl/Greenland"# Fetch data from each APIdf_en <-fetch_data(url_en, 'en')df_da <-fetch_data(url_da, 'da')df_kl <-fetch_data(url_kl, 'kl')# Combine the dataframessubjectcode_text <-bind_rows(df_en, df_da, df_kl) %>%select(-type)subjectcode_text %>%sample_n(5) %>%statgl_table()
px_save() the px-object as .px …. or .xlsx, yes the object can also be saved to Excel, then edited, and converted back to the px-object and finally saved to px with the changes done in Excel
pxmake can read a .px or a .xlsx to a px-object using px()
If the dataframe has more than 1 million rows (Excel limit), data can be stored as either parquet og Rds files.
px_save(x, "stat_test.px")px_save(x, "stat_test.xlsx")# # y <- px("stat_test.xlsx")# # z <- px("stat_test.px")
… or just for further analysis, the dataframe ‘statlog’ can be transformed to a px-file for each selected variable. Footnotes in the px-files can be used to present additional information on the data. The px-files will be stored in a folder, along with needed Alias-text files for presentation in Pxweb.
In statlog, the variables log_level and user_id does not hold useable information, so they are excluded.
Timestamp is too detailed, so metadata is aggregated to monthly data, before presentation, by: time = format(as.Date(timestamp), “%YM%m”) or
biannual as:
time = ifelse(month(as.Date(timestamp)) <= 6, paste0(year(as.Date(timestamp)), “H1”), paste0(year(as.Date(timestamp)), “H2”))
This presentation is to allow researcher to learn, what to expect, the variables distribution is like. They can have access at the most detailed level.
The px-files holds a frequency count on each variables unique codeset and the knowledge needed to work with data.
---always_allow_html: trueformat: html: code-tools: true self-contained: true toc: true toc-location: left toc-title: "Usecase:" toc_summary: true toc-depth: 4 toc_collapsed: true toc_float: collapsed: trueexecute: echo: falseoutput: statgl::statgl_report: output: "pretty_html"editor: visual---# pxmake version 12.1(update `r format(Sys.time(), '%Y-%m-%d.')`)This note will be presented at the annual PX-meeting in Skopje, North Macedonia, November 7th, 2024 Title: pxmake - multilingual px files with R, 2024 # **PXWeb-logfiles**Unfortunately 'PXWeb-out-of-the-box' does not offer usage statistics of any kind. Most likely all PXWeb-based services regularly need to deliver management information as PXWeb is vital to the dissemination policy of many statistical offices . Good news is though, by default PXWeb does store information needed in logfiles.The tricky part is ironically transfering the information to px-files. But as this example will demonstrate, it is not all that difficult, when using pxmake. To serve as learning material this example use R, free to all, to demonstrate how it can be done.And do not forget: the easiest way to disseminate information is with PXWeb/PWin.**ALL code below, runs in a project called: 'log'**```{r packages_in_use, warning=FALSE, message=FALSE,echo=TRUE}#| code-fold: true#| code-summary: "click to see/hide list of packages used"# Standard R-packagesif(!require("tidyverse")) {install.packages("tidyverse")}library(tidyverse)if(!require("lubridate")) {install.packages("lubridate")}library(lubridate)if(!require("janitor")) {install.packages("janitor")}library(janitor)if(!require("dplyr")) {install.packages("dplyr")}library(dplyr)if(!require("RCurl")) {install.packages("RCurl")}library(RCurl)if(!require("arrow")) {install.packages("arrow")}library(arrow)if(!require("xml2")) {install.packages("xml2")}library(xml2)# 2 get subjects used in Statbank Greenlandif(!require("httr")) {install.packages("httr")}library(httr)if(!require("jsonlite")) {install.packages("jsonlite")}library(jsonlite)# Commands used to setup pxmake ------------------------------------------------------------# install.packages('devtools')# devtools::install_github('StatisticsGreenland/pxmake')library(pxmake)library(statgl)```## Log TypesPXWeb has 3 different log-files apart from the normal web-server logs. They are stored on the server in:\~/PxWeb/logs, as:1) api.log\*\2) pxweb.log\*\3) stats.log\*### Data PreparationIf you do not have R installed on the web-server, different strategies can be followed.1) Set permissions on server to allow the log files to be read2) Copy all log-files, you want to analyse, to a file-server3) Concatinate all same type log-files to one as .txt before copying\@3. On a windows server this can be done with the DOS copy command```{r, concatinate_on_webserver, echo=TRUE, eval=FALSE}Log on the server, run the cmd prompt and enter the dos copy command:cd c:/inetpub/wwwroot/PxWeb/logs/ copy /y stats*.* stats.txt cd c:/inetpub/wwwroot/PxWeb/logs/ copy /y api*.* api.txt cd c:/inetpub/wwwroot/PxWeb/logs/ copy /y pxweb*.* pxweb.txt :: Concatenate all logs matching stats*.* into statsall.txtcopy /y "C:\inetpub\wwwroot\PxWeb\logs\stats*.*" "C:\inetpub\wwwroot\home\pxmake\statsall.txt":: Define source and destination pathsset "source=C:\inetpub\wwwroot\home\pxmake\statsall.txt"set "destination=C:\inetpub\wwwroot\home\pxmake\statsall.zip":: Compress the statsall.txt into statsall.zip using PowerShellpowershell -command "Compress-Archive -Path '%source%' -DestinationPath '%destination%' -Force"echo Zip file created at %destination%\```On https://bank.stat.gl a Task was setup to run every night, concatinating, and zipping the concatinated file as statsall.zip in the pxmake subfolder:## stats.log First let's look at **stats.log**, saved to one file pr day. One row is added to the log-file for each table being extracted or manipulated by Pxweb.Content example:```{r, stats_content, echo=TRUE, eval=FALSE}from http://bank.stat.gl:a) 2024-07-19 06:43:01,028 \[20\] INFO PCAxis.Web.Controls.PxDefaultLogger -b) Context=Selection,c) UserId=userid,d) Language=da,e) Database=Greenland,f) ActionType=Presentation,g) ActionName=tableViewLayout1,h) TableId=C:\inetpub\wwwroot\PxWeb\Resources\PX\Databases\Greenland\...UDXISCPROH.px,i) NumberOfCells=3,j) NumberOfContents=1In the Greenlandic StatBank sample-files we find: Context: Selection, PresentationUserId: is not setLanguage: en, da, klDatabase: Greenland, GSmicroActionType: Presentation, SaveAs, OperationActionName: tableViewSorted, tableViewLayout2, tableViewLayout1, sum, subtract, pivotManual, pivotCW, pivotCCW, perPart, informationView, footnoteView, FileTypeRelational, FileTypePX, FileTypeJsonStat, FileTypeJson, FileTypeHtml5Table, FileTypeHtml, FileTypeExcelXDoubleColumn, FileTypeExcelX, FileTypeExcelDoubleColumn, FileTypeExcel, FileTypeCsvWithoutHeadingAndTabulator, FileTypeCsvWithHeadingAndTabulator, FileTypeCsvWithHeadingAndSpace, FileTypeCsvWithHeadingAndSemiColon, FileTypeCsvWithHeadingAndComma, FileTypeChartPng, FileTypeChartJpeg, divide, deleteVariable, deleteValue, chartViewRadar, chartViewPopulationPyramid, chartViewPoint, chartViewPie, chartViewLine, chartViewColumnStacked100, chartViewColumnStacked, chartViewColumnLine, chartViewColumn, chartViewBarStacked100, chartViewBarStacked, chartViewBar, chartViewAreaStacked100, chartViewAreaStacked, chartViewArea, changeValueOrder, changeTextCodePresentation TableId: UDXISCPROH.px NumberOfCells NumberOfContents ```### Read and extractDownloading and reading the log file can be time consuming, so only run next chunck, when needed. If a zipped version exist it might not be a problem.The 'stat_log_data' dataframe is saved to disk in the parquet format, for minimal footprint. In this example both Rds (Rdataset) and parquet is written to the filesystem. The parquet is 5.4 Mb down from 69 Mb in Rds.As pxmake uses a variable called language, the language variable in the statlog file is renamed as 'Lang' in this stepThis is too lengtly, so the chunck option eval is set to FALSE\But this is the code and the result is saved to a Parquet file for later use \\```{r read_stattxt, message=FALSE, warning=FALSE, echo=TRUE, eval=FALSE}# URL and destination file pathurl <- "https://stat.gl/pxmake/statsall.zip"destfile <- "statsall.zip"# Download filedownload.file(url, destfile, mode = "wb")unzip("statsall.zip", exdir = ".")log_data <- readLines("statsall.txt")# Convert log_data into a data framelog_df <- data.frame(log_entry = log_data, stringsAsFactors = FALSE)# Extract relevant components using regular expressionsstat_log_data <- log_df %>% mutate( Timestamp = str_extract(log_entry, "^[^\\[]+"), LogLevel = str_extract(log_entry, "(?<=\\[)[^\\]]+"), Context = str_extract(log_entry, "(?<=Context=)[^,]+"), UserId = str_extract(log_entry, "(?<=UserId=)[^,]+"), Lang = str_extract(log_entry, "(?<=Language=)[^,]+"), # Language is used by pxmake Database = str_extract(log_entry, "(?<=Database=)[^,]+"), ActionType = str_extract(log_entry, "(?<=ActionType=)[^,]+"), ActionName = str_extract(log_entry, "(?<=ActionName=)[^,]+"), TableId = sub("\\.[pP][xX]$", "", basename(str_extract(log_entry, "(?<=TableId=)[^,]+"))), NumberOfCells = as.integer(str_extract(log_entry, "(?<=NumberOfCells=)[0-9]+")), NumberOfContents = as.integer(str_extract(log_entry, "(?<=NumberOfContents=)[0-9]+")) ) %>% select(-log_entry) %>% clean_names()write_rds(stat_log_data,"statlog.Rds")write_parquet(stat_log_data,"statlog.parquet")```## api logCoding with chatGPT:i want load data from a log file into a dataframe in r. some lines have errors:added a number of lines from log files Also I used ip to country lookup from https://lite.ip2location.com/ and saved dataframe toapilog.parquet```{r wrangle_apidata, message=FALSE, warning=FALSE, echo=TRUE, eval=FALSE}# code-fold: true# code-summary: "code"# Step 1: Read in the log filelog_lines <- readLines("api.txt")# Step 2: Filter out lines that contain structured log informationlog_data <- log_lines %>% # Filter lines with date pattern; modify if needed based on your log format grep("^\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}", ., value = TRUE) %>% # Create a data frame data.frame(line = ., stringsAsFactors = FALSE) %>% # Extract structured components mutate( datetime = str_extract(line, "^\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}"), thread_id = str_extract(line, "\\[\\d+\\]") %>% str_remove_all("\\[|\\]"), level = str_extract(line, "\\b(INFO|ERROR|WARN|DEBUG)\\b"), url = str_extract(line, "url=[^,]+") %>% str_remove("url="), type = str_extract(line, "type=[^,]+") %>% str_remove("type="), caller = str_extract(line, "caller=[^,]+") %>% str_remove("caller="), cached = str_extract(line, "cached=[^,]+") %>% str_remove("cached="), message = str_extract(line, "message=.*$") %>% str_remove("message=") ) %>% select(-line) # Remove the original line column# KanJegMonBrugeDenne24?# https://lite.ip2location.com/library(ip2location)# Initialize the IP2Location databaseip_db <- "IP2LOCATION/IP2LOCATION-LITE-DB1.BIN"ip_lookup <- ip2location::open(ip_db)# Ensure your function works with a single IP address at a timeget_country_code <- function(ip) { result <- ip2location::get_all(ip) %>% as.data.frame() list(country_short = result$country_short, country_long = result$country_long)}# Use rowwise() to apply get_country_code to each row individuallyl_data <- log_data %>% rowwise() %>% mutate( country_info = list(get_country_code(caller)), countrycode = country_info$country_short, countrytxt = country_info$country_long ) %>% ungroup() %>% select(-country_info) # Remove the intermediate list column if not neededwrite_parquet(l_data,"apilog.parquet")```## pxweb log```{r wrangle_pxwebdata, message=FALSE, warning=FALSE, echo=TRUE, eval=FALSE}# code-fold: true# code-summary: "code"# Define the path to the log filelog_file <- "pxweb.txt"# Define a function to parse log entriesparse_log_entry <- function(line) { # Regular expression to match the structured log format pattern <- "^(\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3}) \\[(\\d+)\\] (\\w+)\\s+([\\w\\-]+) - (.+)$" # Attempt to match the pattern match <- str_match(line, pattern) # If a match is found, return as a data frame row, else return NULL if (!is.na(match[1, 1])) { data.frame( timestamp = match[1, 2], thread_id = as.integer(match[1, 3]), log_level = match[1, 4], category = match[1, 5], details = match[1, 6], stringsAsFactors = FALSE ) } else { NULL }}# Read and parse the log filepxweb_log_data <- readLines(log_file) %>% # Apply the parsing function to each line and discard NULLs lapply(parse_log_entry) %>% # Remove NULLs from the list compact() %>% # Bind all entries into a data frame bind_rows()write_parquet(pxweb_log_data,"pxweblog.parquet")```# Let the fun beginThis section converts a dataframe to a px-object with minimal metadata. For maximum satisfaction make sure Pxwin is installed on your pc (Windows only)A statlog.txt file has been uploaded to www.stat.gl/pxmake in parquet format, despite the txt extension, to allow free play.```{r stats_px, warning=FALSE, message=FALSE, echo=TRUE}# read local log-parquetstat_log_data <- read_parquet("statlog.parquet")SBXSTAT <- stat_log_data %>% mutate(time = format(as.Date(timestamp), "%YM%m")) %>% filter(database == "Greenland") %>% mutate(subjectcode=substr(table_id,1,2)) %>% select(time,subjectcode,lang,action_type,action_name) %>% count(across(everything()))px(SBXSTAT) %>% px_add_totals(c("subjectcode","lang","action_name", "action_type")) %>% px_save("SBXSTAT.px") px(SBXSTAT) %>% px_elimination(tribble(~`variable-code`, ~elimination, 'subjectcode', 'YES', 'lang', 'Total', "time","YES")) %>% px_add_totals(c("lang","action_name", "action_type")) %>% px_save("SBXSTAT.px") # Read the first 12 linesfirst_12_lines <- readLines("SBXSTAT.px", n = 12)# View the linesfirst_12_lines``````{r api_px, warning=FALSE,message=FALSE, echo=TRUE, eval=FALSE}# code-fold: true# code-summary: "Kode"api_log_data <- read_parquet("apilog.parquet") %>% mutate(countrycode=ifelse(countrycode=="-","GS",countrycode), countrytxt=ifelse(countrytxt=="-","Statistics Greenland",countrytxt))# Step 3: Handle missing valuesapi_data <- api_log_data %>% replace_na(list(url = "", type = "", caller = "", cached = "", message = "")) %>% mutate(time = format(as.Date(datetime), "%YM%m"), px=basename(url), px=sub("\\.px.*|\\.PX.*|\\.csv.*|\\.json.*|\\?.*", "", px)) %>% select(time,type,px,countrytxt,caller) %>% mutate(countrytxt=ifelse(countrytxt=="-",caller,countrytxt)) %>% select(time,type,px,countrytxt)api <- api_data %>% filter(str_sub(px,3,3)=="X" & type %in% c("config","data","metadata","error") & px!="" & px!="variables" & px!="Variables") %>% count(across(everything()))px(api) %>% px_elimination(tribble(~`variable-code`, ~elimination, 'type', 'YES', 'px', 'YES', 'countrytxt', 'YES', 'time', 'YES',)) %>% px_save("API.px")``````{r pxweb_px, warning=FALSE,message=FALSE, echo=TRUE, eval=FALSE}# code-fold: true# code-summary: "code"pxweb_log_data <- read_parquet("pxweblog.parquet")# Step 3: Handle missing valuespxweb_data <- pxweb_log_data %>% replace_na(list(url = "", type = "", caller = "", cached = "", message = "")) %>% mutate(time = format(as.Date(timestamp), "%YM%m")) %>% select(time,log_level,category,details) pxweb <- pxweb_data %>% select(time,log_level,category) %>% count(across(everything()))px(pxweb) %>% px_elimination(tribble(~`variable-code`, ~elimination, 'log_level', 'YES', 'category', 'YES', "time","YES")) %>% px_save("PXWEBLOG.px")```## MetadataA PX-file will need a lot more metadata, to make sense to users and in multiple languages.**Texts instead of codes**StatBank Greenland names px-files with a 2 character prefix for subjects:AR - ARbejdsmarked (labourmarket) BE - BEfolkning (population) etc...All subjects/subjectcodes can be obtained via the api```{r subjects, message=FALSE, warning=FALSE, echo=TRUE}# Function to fetch data from API and add language columnfetch_data <- function(url, language) { response <- GET(url) data <- content(response, as = "text") json_data <- fromJSON(data, flatten = TRUE) %>% as_tibble() json_data <- json_data %>% mutate(language = language) return(json_data)}# URLs for the APIsurl_en <- "https://bank.stat.gl/api/v1/en/Greenland"url_da <- "https://bank.stat.gl/api/v1/da/Greenland"url_kl <- "https://bank.stat.gl/api/v1/kl/Greenland"# Fetch data from each APIdf_en <- fetch_data(url_en, 'en')df_da <- fetch_data(url_da, 'da')df_kl <- fetch_data(url_kl, 'kl')# Combine the dataframessubjectcode_text <- bind_rows(df_en, df_da, df_kl) %>% select(-type)subjectcode_text %>% sample_n(5) %>% statgl_table()```**adding in code**```{r metadata, warning=FALSE, message=FALSE, echo=TRUE}description <- tribble(~language, ~value, "en", str_glue("Usage statistics 1999-{year(Sys.Date())} <em>[SBESTAT01]</em>"), "da", str_glue("Antal udtræk per emne 2020-{year(Sys.Date())} <em>[SBDSTAT01]</em>"), "kl", str_glue("Tabelimit ataatsimit emne 2020-{year(Sys.Date())} <em>[SBNSTAT01]</em>"), )content_txt <- tribble(~language, ~value, "en", str_glue("Usage statistics"), "da", str_glue("Antal udtræk per emne"), "kl", str_glue("Tabelimit ataatsimit emne"), )subjects <- subjectcode_text %>% mutate(`variable-code`="subjectcode", code=id, value=text) %>% select(`variable-code`, code, language, value)prefix_values <- tibble::tribble( ~`variable-code`, ~code, ~language, ~value, ~order, "lang", "en", "en", "English", 1, "lang", "da", "en", "Danish", 2, "lang", "kl", "en", "Greenlandic", 3, "lang", "en", "da", "Engelsk", NA, "lang", "da", "da", "Dansk", NA, "lang", "kl", "da", "Grønlandsk", NA, "lang", "en", "kl", "Tuluttut ", NA, "lang", "da", "kl", "Qallunaatut", NA, "lang", "kl", "kl", "Kalaallisut", NA,) %>% bind_rows(subjects)variable_label <- tribble(~`variable-code`, ~language, ~`variable-label`, 'subjectcode', 'en', 'Subject', 'subjectcode', 'da', 'Emne', 'subjectcode', 'kl', 'Qineq', 'lang', 'en', 'Language', 'lang', 'da', 'Sprog', 'lang', 'kl', 'Killiffigassaq', )```## Data for a Table```{r wrangle_sample, message=FALSE, warning=FALSE, echo=TRUE}SBXSTAT %>% sample_n(6) %>% statgl_table()```## Add to a px-objectHere metadata is added either by keywords or previously defined.```{r pxmake, warning=FALSE, message=FALSE,echo=TRUE}creation_date <- "20240301 09:00"last_updated <- "20240711 09:00"next_update <- "20250301 09:00"x <- SBXSTAT %>% pxmake::px() %>% px_codepage("utf-8") %>% px_language("en") %>% px_languages(c("en", "da", "kl")) %>% px_decimals("0") %>% px_showdecimals("0") %>% px_matrix("SBXSTAT01") %>% px_subject_code("SB") %>% px_subject_area(tribble(~language, ~value, "en", "Statbank Greenland", "da", "Statistikbank", "kl", "Kisitsisaataasivik")) %>% px_contents(content_txt) %>% px_stub(c("subjectcode")) %>% px_heading(c("time", "lang")) %>% px_values(prefix_values %>% select(-order)) %>% px_variable_label(variable_label) %>% px_timeval("time") %>% px_description(description) %>% px_title(content_txt) %>% px_units(tribble(~language, ~value, "en", "number of table extracts", "da", "antal tabeludtræk", "kl", "tabeluttræk")) %>% px_update_frequency("When needed") %>% px_creation_date(creation_date) %>% px_last_updated(last_updated) %>% px_next_update(next_update) %>% px_contact("Lars Pedersen, LARP at STAT.gl") %>% px_source(tribble(~language, ~value, "en", "Statistics Greenland", "da", "Grønlands Statistik", "kl", "Kalaallit Nunaanni Naatsorsueqqissaartarfik")) %>% px_add_totals(c("lang","subjectcode")) %>% px_elimination(tribble(~`variable-code`, ~elimination, 'subjectcode', 'Total', 'lang', 'Total', ))```## Read and Write a px-filepx_save() the px-object as .px .... or .xlsx, yes the object can also be saved to Excel, then edited, and converted back to the px-object and finally saved to px with the changes done in Excelpxmake can read a .px or a .xlsx to a px-object using px()If the dataframe has more than 1 million rows (Excel limit), data can be stored as either parquet og Rds files.```{r savenread, warning=FALSE,message=FALSE, echo=TRUE}px_save(x, "stat_test.px")px_save(x, "stat_test.xlsx")# # y <- px("stat_test.xlsx")# # z <- px("stat_test.px")```To learn more about pxmake read articles on:<https://statisticsgreenland.github.io/pxmake/># For Researchers... or just for further analysis, the dataframe 'statlog' can be transformed to a px-file for each selected variable. Footnotes in the px-files can be used to present additional information on the data. The px-files will be stored in a folder, along with needed Alias-text files for presentation in Pxweb.In statlog, the variables log_level and user_id does not hold useable information, so they are excluded.Timestamp is too detailed, so metadata is aggregated to monthly data, before presentation, by: time = format(as.Date(timestamp), "%YM%m") or\biannual as:\time = ifelse(month(as.Date(timestamp)) \<= 6, paste0(year(as.Date(timestamp)), "H1"), paste0(year(as.Date(timestamp)), "H2"))This presentation is to allow researcher to learn, what to expect, the variables distribution is like. They can have access at the most detailed level.The px-files holds a frequency count on each variables unique codeset and the knowledge needed to work with data.```{r researchers, warning=FALSE,message=FALSE, echo=TRUE}# # Read the log file# Biannualstatlogh <- stat_log_data %>% mutate(time = ifelse(month(as.Date(timestamp)) <= 6, paste0(year(as.Date(timestamp)), "H1"), paste0(year(as.Date(timestamp)), "H2"))) %>% select(time, context,lang,table_id,action_type,action_name)# Monthlystatlogm <- stat_log_data %>% mutate(time = format(as.Date(timestamp), "%YM%m")) %>% select(time, context,lang,table_id,action_type,action_name)y <- px(statlogh)px_heading(y)y <- px(statlogh) %>% px_stub("action_type") %>% px_heading("time")individual_keywords_ml <- tibble::tribble( ~variable, ~language, ~px_description, ~px_matrix, "context", "en", "context", "context", "context", "da", "context", NA, "context", "kl", "context", NA, "lang", "en", "lang", "lang", "lang", "da", "lang", NA, "lamg", "kl", "lang", NA, "table_id", "en", "table_id", "table_id", "table_id", "da", "table_id", NA, "table_id", "kl", "table_id", NA, "action_type", "en", "Action type", "actiont", "action_type", "da", "Action type", NA, "action_type", "kl", "Action type", NA, "action_name", "en", "Action name", "actionn", "action_name", "da", "Action name", NA, "action_name", "kl", "Action name", NA, )z <- y %>% px_codepage("utf-8") %>% px_language("en") %>% px_languages(c("en", "da", "kl")) %>% px_decimals("0") %>% px_showdecimals("0") %>% px_subject_code("SB") %>% px_subject_area(tribble(~language, ~value, "en", "Statbank Greenland", "da", "Statistikbank", "kl", "Kisitsisaataasivik")) %>% px_contents(content_txt) %>% px_timeval("time") %>% px_description(description) %>% px_title(content_txt) %>% px_units(tribble(~language, ~value, "en", "number of table extracts", "da", "antal tabeludtræk", "kl", "tabeluttræk")) %>% px_update_frequency("When needed") %>% px_creation_date(creation_date) %>% px_last_updated(last_updated) %>% px_next_update(next_update) %>% px_contact("Lars Pedersen, LARP at STAT.gl") %>% px_autopen('YES')unlink("SB", recursive = TRUE)dir.create("SB")px_micro(z, out_dir = "SB", keyword_values = individual_keywords_ml )```