I am trying to replicate the 1st form of analysis, namely, estimating the alpha tail index, demonstrated in Python in this Medium article, but in R. The datasets used are here, this link is in the article as well.
Just because it was the only option I found when doing Bing and Brave searches, I am using the HTailIndex() function from the ExtremeRisks package. To keep things simple, I am only going to share my code for estimating the alpha tail index for one of the datasets, namely, the Medium traffic dataset. Here is my code for setting everything up before estimation:
# import necessary libraries
library(tidyverse)
library(ExtremeRisks)
### Load the data
medium <- read_csv("medium-followers.csv", col_names = TRUE)
> ### Load the data
> read_csv("medium-followers.csv", col_names = TRUE)
Rows: 37 Columns: 2
── Column specification ─────────────────────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): followers_gained
date (1): period_start
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 37 × 2
period_start followers_gained
<date> <dbl>
1 2023-11-01 111
2 2023-10-01 1783
3 2023-09-01 760
4 2023-08-01 171
5 2023-07-01 165
6 2023-06-01 131
7 2023-05-01 55
8 2023-04-01 48
9 2023-03-01 109
10 2023-02-01 79
# ℹ 27 more rows
# ℹ Use `print(n = ...)` to see more rows
This is what happened the first time I tried to use the HTailIndex() function to estimate the value of the tail index alpha of the distribution of the Medium data (although I am not attempting to calculate a single alpha value here, but a range of different alpha estimates using different k parameters, then I plan on plotting the relationship between the alpha estimates and the k input values to decide which k to use, this is standard practice in this context):
### 1. Tail Index
> sample_size_medium <- length(medium$followers_gained)
> sample_size_medium
[1] 37
> k_values_medium <- round(seq(0.05 * sample_size_medium, 0.10 * sample_size_medium, length.out = 10))
> k_values_medium
[1] 2 2 2 2 3 3 3 3 3 4
> tail_indices_medium <- sapply(k_values_medium, function(k) HTailIndex(medium$followers_gained, k))
> # Plotting the tail index against k values
> plot(k_values_medium, tail_indices_medium, type = "b", xlab = "k", ylab = "Tail Index")
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
I assume the problem is that one or more of the 'k's are not being calculated by HTailIndex for some reason, so I tried to alter my code so as to ensure that HTailIndex is returning a single value for each 'k' using error handling, and this is what I came up with:
# same as before for the first 2 commands
> sample_size_medium <- length(medium$followers_gained)
> k_values_medium <- round(seq(0.05 * sample_size_medium, 0.10 * sample_size_medium, length.out = 10))
> tail_indices_medium <- sapply(k_values_medium, function(k) {
+ tryCatch({
+ ti <- HTailIndex(medium$followers_gained, k)
+ if(length(ti) == 1) return(ti) else return(NA)
+ }, error = function(e) NA)
+ })
> # Ensure both vectors have the same length
> if(length(tail_indices_medium) == length(k_values_medium)) {
+ # Plotting the tail index against k values
+ plot(k_values_medium, tail_indices_medium, type = "b", xlab = "k", ylab = "Tail Index")
+ } else {
+ print("Mismatch in lengths of k_values and tail_indices vectors")
+ }
Error in plot.window(...) : need finite 'ylim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
This version includes tryCatch so that if HTailIndex fails for any value of k, it returns an NA. So when it failed, the first thing I did is check for NAs, and this is what I found:
> print(tail_indices_medium)
[1] NA NA NA NA NA NA NA NA NA NA
I don't know where to go from here.