The goal of this post is to show how to count the number of unique google summer of code students I have mentored. The data source is my Teaching web page, source code shown below:

projects.string <- "
I have mentored the
following students in coding free/open-source software.
- [Arthur Pan](https://github.com/ampurrr), 2023, polars in R.
- [Jocelyne Chen](https://github.com/ampurr), 2023, [animint2
  documentation and bug fixes](https://gsoc.ampurr.com).
- [Yufan Fei](https://github.com/Faye-yufan), 2023, animint2:
  interactive grammar of graphics.
- [Yufan Fei](https://github.com/Faye-yufan), 2022, [animint2: interactive
  grammar of
  graphics](https://github.com/Faye-yufan/gsoc22-animint/blob/master/README.md).
- [Fabrizio Sandri](https://github.com/FabrizioSandri), 2022,
  [RcppDeepState: github action for fuzz testing C++ code in R
  packages](https://fabriziosandri.github.io/gsoc-2022-blog/summary/2022/09/08/gsoc-summary.html).
- [Daniel Agyapong](https://github.com/EngineerDanny), 2022, [Rperform
  github action for performance testing R
  packages](https://engineerdanny.github.io/GSOC22-RPerform-Blog/).
- [Anirban Chetia](https://github.com/Anirban166), 2021, [directlabels
  improvements](https://github.com/Anirban166/directlabels).
- [Diego Urgell](https://github.com/diego-urgell), 2021,
  [BinSeg](https://github.com/diego-urgell/BinSeg) efficient C++
  implementation of binary segmentation.
- [Mark Nawar](https://github.com/Mark-Nawar), 2021, [re2r back on CRAN](https://github.com/rstats-gsoc/gsoc2021/wiki/re2r-back-on-CRAN)
- [Sanchit Saini](https://github.com/sanchit-saini), 2020,
  [rtracklayer R package
  improvements](https://github.com/rstats-gsoc/gsoc2020/wiki/rtracklayer-improvements).
- [Himanshu Singh](https://github.com/lazycipher), 2020, [animint2:
  interactive grammar of
  graphics](https://github.com/tdhock/animint2).
- [Julian Stanley](https://github.com/julianstanley), 2020, [Graphical
  User Interface for gfpop R
  package](https://github.com/julianstanley/gfpop-gui).
- [Anirban Chetia](https://github.com/Anirban166), 2020,
  [testComplexity R
  package](https://github.com/Anirban166/testComplexity).
- [Anuraag Srivastava](https://github.com/as4378), 2019, [Optimal
  Partitioning algorithm and opart R
  package](https://github.com/as4378/opart).
- [Avinash Barnwal](https://github.com/avinashbarnwal/), 2019, [AFT
  and Binomial loss functions for
  xgboost](https://github.com/avinashbarnwal/GSOC-2019).
- [Aditya Sam](https://github.com/theadityasam/), 2019, [Elastic net regularized interval regression and iregnet R package](https://theadityasam.github.io/GSOC2019/).
- [Alan Williams](https://github.com/aw1231), 2018,
  [SegAnnDB: machine learning system for DNA copy number analysis](https://github.com/tdhock/SegAnnDB),
  [blog](https://medium.com/alans-gsoc-blog/work-product-a1080d175160).
- [Vivek Kumar](https://github.com/vivekktiwari), 2018,
  [animint2: interactive grammar of graphics](https://github.com/tdhock/animint2),
  [blog](https://vivekktiwari.github.io/gsoc18/).
- [Johan Larsson](https://github.com/jolars), 2018,
  [sgdnet: SAGA algorithm for sparse linear models](https://github.com/jolars/sgdnet).
- [Marlin Na](https://github.com/Marlin-Na), 2017,
  [TnT: interactive genome browser](https://github.com/Marlin-Na/TnT).
- [Rover Van](https://github.com/RoverVan), 2017, [iregnet: regularized interval regression](https://github.com/anujkhare/iregnet).
- [Abhishek Shrivastava](https://github.com/abstatic), 2016,
  [SegAnnDB: interactive system for labeling and machine learning in genomic data](https://github.com/tdhock/SegAnnDB).
- [Faizan Khan](https://github.com/faizan-khan-iit), 2016--2017, [animint: interactive grammar of graphics](https://github.com/tdhock/animint).
- [Anuj Khare](https://github.com/anujkhare), 2016, [iregnet: regularized interval regression](https://github.com/anujkhare/iregnet).
- [Qin Wenfeng](https://github.com/qinwf), 2016, [re2r: regular expressions](https://github.com/qinwf/re2r).
- [Akash Tandon](https://github.com/analyticalmonk), 2016, [Rperform: performance testing R packages](https://github.com/analyticalmonk/Rperform).
- [Ishmael Belghazi](https://github.com/IshmaelBelghazi), 2015, [bigoptim: stochastic average gradient algorithm](https://github.com/IshmaelBelghazi/bigoptim).
- [Kevin Ferris](https://github.com/kferris10), 2015, [animint: interactive grammar of graphics](https://github.com/tdhock/animint).
- [Tony Tsai](https://github.com/caijun), 2015, [animint: interactive grammar of graphics](https://github.com/tdhock/animint).
- [Carson Sievert](https://github.com/cpsievert), 2014, [animint: interactive grammar of graphics](https://github.com/tdhock/animint).
- [Susan VanderPlas](https://github.com/srvanderplas), 2013, [animint: interactive grammar of graphics](https://github.com/tdhock/animint).
"

The markdown code above has a regular structure: newline, dash, space, open square bracket, then student name. Later on there is a comma, space, then a year for the GSOC project. We can convert the text string above to a data table with columns for name and year, using the regular expression R code below.

(projects.dt <- nc::capture_all_str(
  projects.string,
  "\n- \\[",
  name="[^]]+",
  ".*?, ",
  year="[0-9]+", as.integer))
##                     name  year
##                   <char> <int>
##  1:           Arthur Pan  2023
##  2:        Jocelyne Chen  2023
##  3:            Yufan Fei  2023
##  4:            Yufan Fei  2022
##  5:      Fabrizio Sandri  2022
##  6:      Daniel Agyapong  2022
##  7:       Anirban Chetia  2021
##  8:         Diego Urgell  2021
##  9:           Mark Nawar  2021
## 10:        Sanchit Saini  2020
## 11:       Himanshu Singh  2020
## 12:       Julian Stanley  2020
## 13:       Anirban Chetia  2020
## 14:   Anuraag Srivastava  2019
## 15:      Avinash Barnwal  2019
## 16:           Aditya Sam  2019
## 17:        Alan Williams  2018
## 18:          Vivek Kumar  2018
## 19:        Johan Larsson  2018
## 20:            Marlin Na  2017
## 21:            Rover Van  2017
## 22: Abhishek Shrivastava  2016
## 23:          Faizan Khan  2016
## 24:           Anuj Khare  2016
## 25:          Qin Wenfeng  2016
## 26:         Akash Tandon  2016
## 27:     Ishmael Belghazi  2015
## 28:         Kevin Ferris  2015
## 29:            Tony Tsai  2015
## 30:       Carson Sievert  2014
## 31:     Susan VanderPlas  2013
##                     name  year

One way to get unique students is to use by inside of data table square brackets, as below:

projects.dt[, .(projects=.N), by=.(student=name)]
##                  student projects
##                   <char>    <int>
##  1:           Arthur Pan        1
##  2:        Jocelyne Chen        1
##  3:            Yufan Fei        2
##  4:      Fabrizio Sandri        1
##  5:      Daniel Agyapong        1
##  6:       Anirban Chetia        2
##  7:         Diego Urgell        1
##  8:           Mark Nawar        1
##  9:        Sanchit Saini        1
## 10:       Himanshu Singh        1
## 11:       Julian Stanley        1
## 12:   Anuraag Srivastava        1
## 13:      Avinash Barnwal        1
## 14:           Aditya Sam        1
## 15:        Alan Williams        1
## 16:          Vivek Kumar        1
## 17:        Johan Larsson        1
## 18:            Marlin Na        1
## 19:            Rover Van        1
## 20: Abhishek Shrivastava        1
## 21:          Faizan Khan        1
## 22:           Anuj Khare        1
## 23:          Qin Wenfeng        1
## 24:         Akash Tandon        1
## 25:     Ishmael Belghazi        1
## 26:         Kevin Ferris        1
## 27:            Tony Tsai        1
## 28:       Carson Sievert        1
## 29:     Susan VanderPlas        1
##                  student projects

Another way to do that is to dcast, which is shown below,

data.table::dcast(
  projects.dt,
  name ~ .,
  list(length, min, max),
  value.var="year"
)[, year_range := ifelse(
  year_min==year_max,
  year_min,
  sprintf("%d-%d", year_min, year_max)
)][order(year_range)]
## Warning in dcast.data.table(projects.dt, name ~ ., list(length, min, max), :
## NAs introduits lors de la conversion automatique en 'integer'

## Warning in dcast.data.table(projects.dt, name ~ ., list(length, min, max), :
## NAs introduits lors de la conversion automatique en 'integer'
##                     name year_length year_min year_max year_range
##                   <char>       <int>    <int>    <int>     <char>
##  1:     Susan VanderPlas           1     2013     2013       2013
##  2:       Carson Sievert           1     2014     2014       2014
##  3:     Ishmael Belghazi           1     2015     2015       2015
##  4:         Kevin Ferris           1     2015     2015       2015
##  5:            Tony Tsai           1     2015     2015       2015
##  6: Abhishek Shrivastava           1     2016     2016       2016
##  7:         Akash Tandon           1     2016     2016       2016
##  8:           Anuj Khare           1     2016     2016       2016
##  9:          Faizan Khan           1     2016     2016       2016
## 10:          Qin Wenfeng           1     2016     2016       2016
## 11:            Marlin Na           1     2017     2017       2017
## 12:            Rover Van           1     2017     2017       2017
## 13:        Alan Williams           1     2018     2018       2018
## 14:        Johan Larsson           1     2018     2018       2018
## 15:          Vivek Kumar           1     2018     2018       2018
## 16:           Aditya Sam           1     2019     2019       2019
## 17:   Anuraag Srivastava           1     2019     2019       2019
## 18:      Avinash Barnwal           1     2019     2019       2019
## 19:       Himanshu Singh           1     2020     2020       2020
## 20:       Julian Stanley           1     2020     2020       2020
## 21:        Sanchit Saini           1     2020     2020       2020
## 22:       Anirban Chetia           2     2020     2021  2020-2021
## 23:         Diego Urgell           1     2021     2021       2021
## 24:           Mark Nawar           1     2021     2021       2021
## 25:      Daniel Agyapong           1     2022     2022       2022
## 26:      Fabrizio Sandri           1     2022     2022       2022
## 27:            Yufan Fei           2     2022     2023  2022-2023
## 28:           Arthur Pan           1     2023     2023       2023
## 29:        Jocelyne Chen           1     2023     2023       2023
##                     name year_length year_min year_max year_range

The table above has 29 rows, one for each GSOC student I have mentored. There are a total of 31 projects; Anirban Chetia and Yufan Fei each did two consecutive years of GSOC.

Note that the warning above about NAs is a false positive, that I proposed to remove.

Exercise for the reader: add some code above to download the markdown text string of data to parse (projects.string), rather than defining it directly in R code.