Today I checked how many packages I have on the CRAN packages by Maintainer web page (no link provided because the web page is a huge memory hog). I counted by eye and found 14 packages for which I am maintainer. How would we do this programmatically?

The Revolution blog explains one way to do it using a regular expression,

stri_trans_general(str_trim(str_match(
  MAINTAINER, "^[\'\"]?([^\'\",(<]+).*<")[,2]),"latin-ascii"))

As mentioned in that blog, this regex is hard to read, so suffers from the famous “now you have another problem” (code that works but hard to read). Here I explain how you could solve both problems (text parsing and code readability) using regular expressions via my nc package.

Direct translation into nc code

We begin with a direct translation of the other blog code into nc code. First we download the CRAN package database and parse the Maintainer field using the same code as the other blog:

> pkglist <- tools::CRAN_package_db()
> subject <- pkglist[["Maintainer"]]
> stringr.maint <- stringr::str_match(
+   subject, "^[\'\"]?([^\'\",(<]+).*<")[,2]

A direct translation of the regex string literal pattern above into a more readable nc pattern would be:

> maybe.quote <- "[\'\"]?"
> not.special <- "[^\'\",(<]+"
> until.email <- ".*<"
> nc.pattern <- list(
+   "^",
+   maybe.quote,
+   maint=not.special, 
+   until.email)

Using that pattern with nc and the ICU engine (same as stringr::str_match) gives

> nc::capture_first_vec(
+   subject,
+   nc.pattern,
+   engine="ICU")
Error in stop.fun(missing.match) : 
  subjects 7574,9068,12641,12972 did not match regex below; to output missing rows use nomatch.error=FALSE
(?:(?:^['"]?([^'",(<]+).*<))

By default nc::capture_first_vec stops with an error if any subjects do not match the provided regex. To see what those subjects are we can do

> subject[c(7574,9068,12641,12972)]
[1] "ORPHANED" "ORPHANED" "ORPHANED" "ORPHANED"

So ORPHANED is a special keyword that does not match the pattern discussed in the other blog. To go ahead with the matching and use a missing value NA for the result for these ORPHANED subjects we can do

> nc.dt <- nc::capture_first_vec(
+   subject,
+   nc.pattern,
+   engine="ICU",
+   nomatch.error=FALSE)
> identical(nc.dt[["maint"]], stringr.maint)
[1] TRUE

The result from the nc code above is identical to the result from stringr, but much more readable. Actually, we could do something similar for improved readability using stringr,

> stringr.pattern <- paste0(
+   maybe.quote,
+   "(", not.special, ")",
+   until.email)
> stringr.maint.pasted <- stringr::str_match(
+   subject, stringr.pattern)[,2]
> identical(stringr.maint.pasted, stringr.maint)
[1] TRUE

Parsing the entire Maintainer field with nc

The code above does NOT parse the entire Maintainer field, i.e.

> stringr.dt <- data.table(subject, stringr.maint)
> stringr.dt[1:2]
                                           subject           stringr.maint
1:       Scott Fortmann-Roe <scottfr@berkeley.edu>     Scott Fortmann-Roe 
2: Raja Sekhara Reddy D.M <raja.duvvuru@gmail.com> Raja Sekhara Reddy D.M 
> stringr.dt[grepl("[(]", subject)][1:2]
                                      subject stringr.maint
1:        Jie (Kate) Hu <hujie0704@gmail.com>          Jie 
2: Yifan (Ethan) Xu <ethan.yifanxu@gmail.com>        Yifan 

The first two examples show that the email is not parsed, and the second two show that the nickname and last name are not always parsed. How can we do that with nc?

First let’s try parsing the full name and email with the pattern below,

> email.pattern <- list(
+   " <",
+   email=".*",
+   ">")
> nc::capture_first_vec(
+   subject,
+   "^",
+   name=".*",
+   email.pattern,
+   "$")
Error in stop_for_na(make.na) : 
  subjects 101,113,461,768,788,888,1167,1519,1666,1739,1777,1927,1928,1948,2304,2396,2517,2630,3370,3533,3571,3633,3638,3660,4054,4063,4301,4548,4570,5002,5063,5156,5159,5178,5183,5208,5241,5435,5639,5836,5847,6072,6121,6386,6481,6602,6992,7034,7574,7913,7963,7967,7988,8077,8320,8781,8782,8916,9038,9068,9178,9300,9368,9369,9499,9929,10216,10225,10636,11063,11076,11146,11147,11487,11506,11845,11916,11917,12151,12223,12317,12319,12412,12424,12429,12641,12688,12958,12972,12978,13087,13200,13349,13563,13700,13842,13952,14140,14383,14615,14617,14618,14643,15363,15464,15749,15829,16026,16041,16243 did not match regex below; to output missing rows use nomatch.error=FALSE
(?:^(.*)(?:(?: <(.*))>)$)
> 

That did not work. What subjects did not match?

> subject[c(101,113,461,768)]
[1] "Marc Vandemeulebroecke\n<marc.vandemeulebroecke@novartis.com>" 
[2] "Bert Gunter<bgunter.4567@gmail.com>"                           
[3] "Nicolas Frerebeau\n<nicolas.frerebeau@u-bordeaux-montaigne.fr>"
[4] "Sherry Zhao<sxzhao@gwu.edu>"                                   
> 

So there are some subjects which do not have a space between the name and the angled bracket which begins the email. To match that we can do

> maybe.email <- nc::quantifier(
+   "\\s*<",
+   email=".*",
+   ">",
+   "?")
> email.dt <- nc::capture_first_vec(
+   subject,
+   "^",
+   name=".*?",
+   maybe.email,
+   "$")

There was no error, which means everything matched the provided regex. The match data table has one row per subject and one column per capture group (named arguments specified in either the call to nc::capture_first_vec or in one of the sub-patterns),

> email.dt
                         name                       email
    1:     Scott Fortmann-Roe        scottfr@berkeley.edu
    2: Raja Sekhara Reddy D.M      raja.duvvuru@gmail.com
    3:         Sercan Kahveci    sercan.kahveci@sbg.ac.at
    4:             Mintu Nath         dr.m.nath@gmail.com
    5:            Gaurav Sood           gsood07@gmail.com
   ---                                                   
16293:       Michael Lawrence           michafla@gene.com
16294:    Thomas Lin Pedersen thomas.pedersen@rstudio.com
16295:           Lionel Henry          lionel@rstudio.com
16296:           Brian Ripley       ripley@stats.ox.ac.uk
16297:              CRAN Team          CRAN@r-project.org

Since we defined email as an optional group above using nc::quantifier, we have some rows with empty email strings,

> email.dt[email==""]
       name email
1: ORPHANED      
2: ORPHANED      
3: ORPHANED      
4: ORPHANED      

Now how could we parse the text inside the parentheses? First of all, how many parentheses can there be?

> table(gsub("[^(]", "", subject), gsub("[^)]", "", subject))
   
              )
    16246     0
  (     0    51

So there are either no parens (16246 subjects) or one pair of parens (51 subjects). Looking at those subjects,

> grep("[(]", email.dt[["name"]], value=TRUE)[1:5]
[1] "Jie (Kate) Hu"         "Yifan (Ethan) Xu"      "Bao Sheng Loe (Aiden)"
[4] "Zhiyuan (Jason) Xu"    "Kuan-Yu (Alex) Chen"  

The parens may come in the middle or at the end. Here is a pattern for that:

> maybe.nickname <- nc::quantifier(
+   "[(]",
+   nickname=".*",
+   "[)]",
+   "?")
> nick.dt <- nc::capture_first_df(email.dt, name=list(
+   "^",
+   maybe.quote,
+   before='[^("\']*',
+   maybe.quote,
+   maybe.nickname,
+   after=".*?",
+   maybe.quote,
+   "$"))
> nick.dt[nickname!=""][1:5]
                    name                   email         before nickname after
1:         Jie (Kate) Hu     hujie0704@gmail.com           Jie      Kate    Hu
2:      Yifan (Ethan) Xu ethan.yifanxu@gmail.com         Yifan     Ethan    Xu
3: Bao Sheng Loe (Aiden)         bsl28@cam.ac.uk Bao Sheng Loe     Aiden      
4:    Zhiyuan (Jason) Xu        xuxx0284@umn.edu       Zhiyuan     Jason    Xu
5:   Kuan-Yu (Alex) Chen    alexkychen@gmail.com       Kuan-Yu      Alex  Chen

We can then recombine the before and after into a full name via

> nick.dt[, full := sub(" *$", "", gsub(" +", " ", paste(before, after)))]
> nick.dt[nickname!=""][1:5, .(before, after, full)]
           before after          full
1:           Jie     Hu        Jie Hu
2:         Yifan     Xu      Yifan Xu
3: Bao Sheng Loe        Bao Sheng Loe
4:       Zhiyuan     Xu    Zhiyuan Xu
5:       Kuan-Yu   Chen  Kuan-Yu Chen

Finally to query my packages,

> my.i <- nick.dt[, which(full=="Toby Dylan Hocking")]
> data.table(pkglist)[my.i, .(Package, Maintainer, Version)]
            Package                                      Maintainer    Version
 1:      binsegRcpp Toby Dylan Hocking <toby.hocking@r-project.org>   2020.9.3
 2:    directlabels Toby Dylan Hocking <toby.hocking@r-project.org>  2020.6.17
 3:      inlinedocs Toby Dylan Hocking <toby.hocking@r-project.org>  2019.12.5
 4:          LOPART Toby Dylan Hocking <toby.hocking@r-project.org>  2020.6.29
 5:    namedCapture Toby Dylan Hocking <toby.hocking@r-project.org>   2020.4.1
 6:              nc Toby Dylan Hocking <toby.hocking@r-project.org>   2020.8.6
 7:   neuroblastoma    Toby Dylan Hocking <toby@sg.cs.titech.ac.jp>        1.0
 8:       PeakError Toby Dylan Hocking <toby.hocking@r-project.org> 2017.06.19
 9:     PeakSegDisk Toby Dylan Hocking <toby.hocking@r-project.org>  2020.8.13
10:       PeakSegDP Toby Dylan Hocking <toby.hocking@r-project.org> 2017.08.15
11:    PeakSegJoint Toby Dylan Hocking <toby.hocking@r-project.org>  2018.10.3
12:  PeakSegOptimal Toby Dylan Hocking <toby.hocking@r-project.org> 2018.05.25
13: penaltyLearning Toby Dylan Hocking <toby.hocking@r-project.org>  2020.5.13
14:     WeightedROC Toby Dylan Hocking <toby.hocking@r-project.org>  2020.1.31

Indeed there seem to be 14 currently. The neuroblastoma package is the oldest one, submitted in 2013 during my postdoc in Japan (Tokyo Tech).

How many other maintainers have more than one email?

> email.name.counts <- nick.dt[, .(count=.N), by=.(email, full.trans)]
> email.name.counts[full.trans=="Toby Dylan Hocking"]
                        email         full.trans count
1: toby.hocking@r-project.org Toby Dylan Hocking    13
2:    toby@sg.cs.titech.ac.jp Toby Dylan Hocking     1
> email.counts <- email.name.counts[, .(emails=.N), by=full.trans]
> email.counts[full.trans=="Toby Dylan Hocking"]
           full.trans emails
1: Toby Dylan Hocking      2
> table(email.counts$emails)

   1    2    3    4    5 
8178  556   53    7    1 

There are hundreds of maintainers with more than one email. Here are the maintainers with the most emails:

> big.emails <- email.counts[3 < emails]
> email.name.counts[big.emails, on="full.trans"]
                          email      full.trans count emails
 1:   Michel.Ballings@GMail.com Michel Ballings     3      5
 2:    Michel.Ballings@UGent.be Michel Ballings     1      5
 3:   michel.ballings@GMail.com Michel Ballings     2      5
 4:   Michel.Ballings@gmail.com Michel Ballings     1      5
 5:   michel.ballings@gmail.com Michel Ballings     1      5
 6:    xl2473@cumc.columbia.edu        Xiang Li     1      4
 7:        spiritcoke@gmail.com        Xiang Li     1      4
 8:            ynaulx@gmail.com        Xiang Li     1      4
 9:          xli256@its.jnj.com        Xiang Li     1      4
10: simon.urbanek@r-project.org   Simon Urbanek    10      4
11: simon.urbanek@R-project.org   Simon Urbanek     3      4
12: Simon.Urbanek@r-project.org   Simon Urbanek    14      4
13: Simon.Urbanek@R-project.org   Simon Urbanek     1      4
14:        Andrew.Parnell@mu.ie  Andrew Parnell     1      4
15:       andrew.parnell@ucd.ie  Andrew Parnell     1      4
16:       Andrew.Parnell@ucd.ie  Andrew Parnell     1      4
17:        andrew.parnell@mu.ie  Andrew Parnell     1      4
18:   qyzhao@statslab.cam.ac.uk   Qingyuan Zhao     1      4
19:             qz280@cam.ac.uk   Qingyuan Zhao     1      4
20:    qyzhao@wharton.upenn.edu   Qingyuan Zhao     1      4
21:         qingyzhao@gmail.com   Qingyuan Zhao     2      4
22:  john.maindonald@anu.edu.au John Maindonald     1      4
23:      jhmaindonald@gmail.com John Maindonald     1      4
24:  John.Maindonald@anu.edu.au John Maindonald     1      4
25:    john@statsresearch.co.nz John Maindonald     1      4
26:            mloos@envibee.ch     Martin Loos     1      4
27:        Martin.Loos@eawag.ch     Martin Loos     2      4
28:           loosmart@eawag.ch     Martin Loos     1      4
29:      mloos@looscomputing.ch     Martin Loos     1      4
30: wei.jiang@polytechnique.edu       Wei Jiang     1      4
31:     wjiangaa@connect.ust.hk       Wei Jiang     1      4
32:             wjiang@kumc.edu       Wei Jiang     1      4
33:       jiangwei@hrbmu.edu.cn       Wei Jiang     1      4
                          email      full.trans count emails

Exercise for the reader: how to do a similar analysis as above but with emails which are case insensitive? e.g. treat michel.ballings@GMail.com and Michel.Ballings@gmail.com as the same email.

Comparison with original post

Is this treatment the same as what they did in the original post? They did something analogous to the code below:

> stringr.trimmed <- stringr::str_trim(stringr.maint)
> orig.maint <- stringi::stri_trans_general(stringr.trimmed, "latin-ascii")
> (orig.top20 <- data.table(orig.maint)[, .(
+   count=.N
+ ), by=orig.maint][order(-count)][1:20])
             orig.maint count
 1:   Scott Chamberlain    80
 2:   Dirk Eddelbuettel    61
 3:        Gabor Csardi    56
 4:         Jeroen Ooms    47
 5:      Hadley Wickham    46
 6:     Kartikeya Bolar    33
 7:  Robin K. S. Hankin    31
 8:           Bob Rudis    31
 9:    Henrik Bengtsson    30
10:     Martin Maechler    29
11:        Jan Wijffels    29
12:       Simon Urbanek    28
13:         Kurt Hornik    28
14:            Max Kuhn    26
15: Thomas Lin Pedersen    25
16:      John Muschelli    25
17:     Torsten Hothorn    25
18:          Jim Hester    25
19:     Muhammad Yaseen    25
20:           Yihui Xie    24

Is stri_trans_general necessary? Yes, for these 22 people who have more than one way to write their names:

> translation.different <- unique(data.table(
+   stringr.trimmed, orig.maint)[stringr.trimmed != orig.maint])
> translation.also.in.trimmed <- unique(stringr.trimmed[
+   stringr.trimmed %in% translation.different$orig.maint])
> translation.different[orig.maint %in% translation.also.in.trimmed]
              stringr.trimmed                orig.maint
 1:       Aurélie Siberchicot       Aurelie Siberchicot
 2:              Gábor Csárdi              Gabor Csardi
 3:       M. Helena Gonçalves       M. Helena Goncalves
 4: Josué M. Polanco-Martínez Josue M. Polanco-Martinez
 5:           Gergely Daróczi           Gergely Daroczi
 6:             Kauê de Sousa             Kaue de Sousa
 7:            Raphaël Bonnet            Raphael Bonnet
 8:               Joël Gombin               Joel Gombin
 9:   Charles-Édouard Giguère   Charles-Edouard Giguere
10:      Virgilio Gómez-Rubio      Virgilio Gomez-Rubio
11:           Øyvind Langsrud           Oyvind Langsrud
12:        José Cláudio Faria        Jose Claudio Faria
13:              Hervé Perdry              Herve Perdry
14:             Alí Santacruz             Ali Santacruz
15:           Mickaël Canouil           Mickael Canouil
16:           Björn Andersson           Bjorn Andersson
17:              Juraj Szitás              Juraj Szitas
18:    Alejandro Jiménez Rico    Alejandro Jimenez Rico
19:         Przemysław Biecek         Przemyslaw Biecek
20:        Dominik Krzemiński        Dominik Krzeminski
21:             Samuel Macêdo             Samuel Macedo
22: Oscar Perpiñán Lamigueiro Oscar Perpinan Lamigueiro
              stringr.trimmed                orig.maint

Does nc compute the same top 20? YES.

> nick.dt[, full.trans := stringi::stri_trans_general(full, "latin-ascii") ]
> nc.sorted <- nick.dt[, .(
+   count=.N
+ ), by=.(orig.maint=full.trans)][order(-count)]
> nc.top20 <- nc.sorted[1:20]
> identical(nc.top20, orig.top20)
[1] TRUE

But what about the other names? There are 59 which are not the same:

> unique(data.table(
+   nc.maint, orig.maint
+ )[nc.maint != orig.maint | is.na(orig.maint)])
                                              nc.maint              orig.maint
 1:                                             Jie Hu                     Jie
 2:                                           Yifan Xu                   Yifan
 3:                                     Mu Sigma, Inc.                Mu Sigma
 4:                                Eoghan T O Halloran              Eoghan T O
 5:                                         Zhiyuan Xu                 Zhiyuan
 6:                                       Kuan-Yu Chen                 Kuan-Yu
 7:                                 Eric D., Feigelson                 Eric D.
 8:                                   Alan O Callaghan                  Alan O
 9:                                           Yang Liu                    Yang
10:                                Alexander Pastukhov               Alexander
11:                                       Sy Han Chiou                  Sy Han
12:                                             Xi LUO                      Xi
13:                                       Ting Fung Ma               Ting Fung
14:                                             Xi Luo                      Xi
15:                                     Mark O Connell                  Mark O
16:                                 Antonio D Ambrosio               Antonio D
17:                                   Kropko, Jonathan                  Kropko
18:                                       Yanwei Zhang                  Yanwei
19:                                    Simon, Reinhard                   Simon
20:                                        Xiaorui Zhu                 Xiaorui
21:                             Garcia-Rodenas, Alvaro          Garcia-Rodenas
22:                                  Brian P. O Connor              Brian P. O
23:                               Mitchell O Hara-Wild              Mitchell O
24:                              Pedro H. C. Sant Anna        Pedro H. C. Sant
25:                                      Seunggeun Lee               Seunggeun
26:                                   Hsiang Hao, Chen              Hsiang Hao
27:                                     Joshua O Brien                Joshua O
28:                                       Xiaofei Wang                 Xiaofei
29:                                        Xuanhua Yin                 Xuanhua
30:                                           Hui Tang                     Hui
31:                                    Michael O Neill               Michael O
32: Markus Loecher, Berlin School of Economics and Law          Markus Loecher
33:                                              Na Li                      Na
34:             Antoine Tremblay, Dalhousie University        Antoine Tremblay
35:                                           ORPHANED                    <NA>
36:                                  Christoph Hoeppke      Christoph  Hoeppke
37:                                    Jared O Connell                 Jared O
38:                                       Tao Liu, PhD                 Tao Liu
39:                                     Eoin O Connell                  Eoin O
40:                                     Lauren O Brien                Lauren O
41:                                    Silvia D Angelo                Silvia D
42:                                          Suhai Liu                   Suhai
43:                                     Ming-Chang Lee              Ming-Chang
44:                                    Patrick O Keefe               Patrick O
45:                                     Nezami,Hossein                  Nezami
46:                                Varhegyi, Nikole E.                Varhegyi
47:                                          Jose Gama                    Jose
48:                                    Tim Triche, Jr.              Tim Triche
49:                                       Xiuquan Wang                 Xiuquan
50:                                    Shawn T. O Neil              Shawn T. O
51:                               Matteo Dell Omodarme             Matteo Dell
52:                                   Lorenzo D Andrea               Lorenzo D
53:                                        Eamon O Dea                 Eamon O
54:                           Nepomechie, David Israel              Nepomechie
55:                                  Marcello D Orazio              Marcello D
56:                            Lucy D Agostino McGowan                  Lucy D
57:                                James P. Howard, II         James P. Howard
58:                             Robert Myles McDonnell Robert Myles  McDonnell
59:                                           Zekun Xu                   Zekun
                                              nc.maint              orig.maint

Some issues to fix (exercise for the reader)

  • nc produced Nezami,Hossein which would needs a space.
  • orig method produced Robert Myles McDonnell which has an extra space.
  • nc produced both Xi LUO and Xi Luo which are the same person but different text strings.
  • nc produces ORPHANED whereas the original method produced missing values, NA. Not to be confused with the name Na (orig) or Na Li (nc).

Where do I appear in the ranking? In the top 50, out of almost 9000 package maintainers! That is the top 1%.

> nc.sorted[, Rank := rank(-count)]
> (N.maintainers <- nrow(nc.sorted))
[1] 8795
> nc.sorted[, percent := 100*Rank/N.maintainers]
> nc.sorted[orig.maint=="Toby Dylan Hocking"]
           orig.maint count rank Rank   percent
1: Toby Dylan Hocking    14   50   50 0.5685048

Finally how many packages do you need to get into the top x%?

> unique(nc.sorted[, .(count, Rank, percent)])
    count   Rank     percent
 1:    80    1.0  0.01137010
 2:    61    2.0  0.02274019
 3:    56    3.0  0.03411029
 4:    47    4.0  0.04548039
 5:    46    5.0  0.05685048
 6:    33    6.0  0.06822058
 7:    31    7.5  0.08527572
 8:    30    9.5  0.10801592
 9:    29   11.0  0.12507106
10:    28   12.5  0.14212621
11:    26   14.0  0.15918135
12:    25   17.0  0.19329164
13:    24   21.0  0.23877203
14:    23   23.0  0.26151222
15:    22   24.5  0.27856737
16:    21   26.0  0.29562251
17:    20   27.5  0.31267766
18:    19   30.5  0.34678795
19:    18   35.5  0.40363843
20:    16   40.0  0.45480387
21:    15   44.0  0.50028425
22:    14   50.0  0.56850483
23:    13   58.5  0.66515065
24:    12   69.5  0.79022172
25:    11   87.5  0.99488346
26:    10  111.5  1.26776578
27:     9  137.5  1.56338829
28:     8  177.0  2.01250711
29:     7  237.5  2.70039795
30:     6  319.0  3.62706083
31:     5  433.0  4.92325185
32:     4  646.0  7.34508243
33:     3 1086.0 12.34792496
34:     2 2137.0 24.29789653
35:     1 5844.5 66.45252985
    count   Rank     percent

So you need 11 packages to be in the top 1%, 4 packages to be in the top 10%, etc.