Parsing CRAN maintainers
Today I checked how many packages I have on the CRAN packages by Maintainer web page (no link provided because the web page is a huge memory hog). I counted by eye and found 14 packages for which I am maintainer. How would we do this programmatically?
The Revolution blog explains one way to do it using a regular expression,
stri_trans_general(str_trim(str_match(
MAINTAINER, "^[\'\"]?([^\'\",(<]+).*<")[,2]),"latin-ascii"))
As mentioned in that blog, this regex is hard to read, so suffers from the famous “now you have another problem” (code that works but hard to read). Here I explain how you could solve both problems (text parsing and code readability) using regular expressions via my nc package.
Direct translation into nc code
We begin with a direct translation of the other blog code into nc code. First we download the CRAN package database and parse the Maintainer field using the same code as the other blog:
> pkglist <- tools::CRAN_package_db()
> subject <- pkglist[["Maintainer"]]
> stringr.maint <- stringr::str_match(
+ subject, "^[\'\"]?([^\'\",(<]+).*<")[,2]
A direct translation of the regex string literal pattern above into a more readable nc pattern would be:
> maybe.quote <- "[\'\"]?"
> not.special <- "[^\'\",(<]+"
> until.email <- ".*<"
> nc.pattern <- list(
+ "^",
+ maybe.quote,
+ maint=not.special,
+ until.email)
Using that pattern with nc and the ICU engine (same as
stringr::str_match
) gives
> nc::capture_first_vec(
+ subject,
+ nc.pattern,
+ engine="ICU")
Error in stop.fun(missing.match) :
subjects 7574,9068,12641,12972 did not match regex below; to output missing rows use nomatch.error=FALSE
(?:(?:^['"]?([^'",(<]+).*<))
By default nc::capture_first_vec
stops with an error if any subjects
do not match the provided regex. To see what those subjects are we can
do
> subject[c(7574,9068,12641,12972)]
[1] "ORPHANED" "ORPHANED" "ORPHANED" "ORPHANED"
So ORPHANED is a special keyword that does not match the pattern
discussed in the other blog. To go ahead with the matching and use a
missing value NA
for the result for these ORPHANED subjects we can
do
> nc.dt <- nc::capture_first_vec(
+ subject,
+ nc.pattern,
+ engine="ICU",
+ nomatch.error=FALSE)
> identical(nc.dt[["maint"]], stringr.maint)
[1] TRUE
The result from the nc code above is identical to the result from stringr, but much more readable. Actually, we could do something similar for improved readability using stringr,
> stringr.pattern <- paste0(
+ maybe.quote,
+ "(", not.special, ")",
+ until.email)
> stringr.maint.pasted <- stringr::str_match(
+ subject, stringr.pattern)[,2]
> identical(stringr.maint.pasted, stringr.maint)
[1] TRUE
Parsing the entire Maintainer field with nc
The code above does NOT parse the entire Maintainer field, i.e.
> stringr.dt <- data.table(subject, stringr.maint)
> stringr.dt[1:2]
subject stringr.maint
1: Scott Fortmann-Roe <scottfr@berkeley.edu> Scott Fortmann-Roe
2: Raja Sekhara Reddy D.M <raja.duvvuru@gmail.com> Raja Sekhara Reddy D.M
> stringr.dt[grepl("[(]", subject)][1:2]
subject stringr.maint
1: Jie (Kate) Hu <hujie0704@gmail.com> Jie
2: Yifan (Ethan) Xu <ethan.yifanxu@gmail.com> Yifan
The first two examples show that the email is not parsed, and the second two show that the nickname and last name are not always parsed. How can we do that with nc?
First let’s try parsing the full name and email with the pattern below,
> email.pattern <- list(
+ " <",
+ email=".*",
+ ">")
> nc::capture_first_vec(
+ subject,
+ "^",
+ name=".*",
+ email.pattern,
+ "$")
Error in stop_for_na(make.na) :
subjects 101,113,461,768,788,888,1167,1519,1666,1739,1777,1927,1928,1948,2304,2396,2517,2630,3370,3533,3571,3633,3638,3660,4054,4063,4301,4548,4570,5002,5063,5156,5159,5178,5183,5208,5241,5435,5639,5836,5847,6072,6121,6386,6481,6602,6992,7034,7574,7913,7963,7967,7988,8077,8320,8781,8782,8916,9038,9068,9178,9300,9368,9369,9499,9929,10216,10225,10636,11063,11076,11146,11147,11487,11506,11845,11916,11917,12151,12223,12317,12319,12412,12424,12429,12641,12688,12958,12972,12978,13087,13200,13349,13563,13700,13842,13952,14140,14383,14615,14617,14618,14643,15363,15464,15749,15829,16026,16041,16243 did not match regex below; to output missing rows use nomatch.error=FALSE
(?:^(.*)(?:(?: <(.*))>)$)
>
That did not work. What subjects did not match?
> subject[c(101,113,461,768)]
[1] "Marc Vandemeulebroecke\n<marc.vandemeulebroecke@novartis.com>"
[2] "Bert Gunter<bgunter.4567@gmail.com>"
[3] "Nicolas Frerebeau\n<nicolas.frerebeau@u-bordeaux-montaigne.fr>"
[4] "Sherry Zhao<sxzhao@gwu.edu>"
>
So there are some subjects which do not have a space between the name and the angled bracket which begins the email. To match that we can do
> maybe.email <- nc::quantifier(
+ "\\s*<",
+ email=".*",
+ ">",
+ "?")
> email.dt <- nc::capture_first_vec(
+ subject,
+ "^",
+ name=".*?",
+ maybe.email,
+ "$")
There was no error, which means everything matched the provided
regex. The match data table has one row per subject and one column per
capture group (named arguments specified in either the call to
nc::capture_first_vec
or in one of the sub-patterns),
> email.dt
name email
1: Scott Fortmann-Roe scottfr@berkeley.edu
2: Raja Sekhara Reddy D.M raja.duvvuru@gmail.com
3: Sercan Kahveci sercan.kahveci@sbg.ac.at
4: Mintu Nath dr.m.nath@gmail.com
5: Gaurav Sood gsood07@gmail.com
---
16293: Michael Lawrence michafla@gene.com
16294: Thomas Lin Pedersen thomas.pedersen@rstudio.com
16295: Lionel Henry lionel@rstudio.com
16296: Brian Ripley ripley@stats.ox.ac.uk
16297: CRAN Team CRAN@r-project.org
Since we defined email
as an optional group above using
nc::quantifier
, we have some rows with empty email
strings,
> email.dt[email==""]
name email
1: ORPHANED
2: ORPHANED
3: ORPHANED
4: ORPHANED
Now how could we parse the text inside the parentheses? First of all, how many parentheses can there be?
> table(gsub("[^(]", "", subject), gsub("[^)]", "", subject))
)
16246 0
( 0 51
So there are either no parens (16246 subjects) or one pair of parens (51 subjects). Looking at those subjects,
> grep("[(]", email.dt[["name"]], value=TRUE)[1:5]
[1] "Jie (Kate) Hu" "Yifan (Ethan) Xu" "Bao Sheng Loe (Aiden)"
[4] "Zhiyuan (Jason) Xu" "Kuan-Yu (Alex) Chen"
The parens may come in the middle or at the end. Here is a pattern for that:
> maybe.nickname <- nc::quantifier(
+ "[(]",
+ nickname=".*",
+ "[)]",
+ "?")
> nick.dt <- nc::capture_first_df(email.dt, name=list(
+ "^",
+ maybe.quote,
+ before='[^("\']*',
+ maybe.quote,
+ maybe.nickname,
+ after=".*?",
+ maybe.quote,
+ "$"))
> nick.dt[nickname!=""][1:5]
name email before nickname after
1: Jie (Kate) Hu hujie0704@gmail.com Jie Kate Hu
2: Yifan (Ethan) Xu ethan.yifanxu@gmail.com Yifan Ethan Xu
3: Bao Sheng Loe (Aiden) bsl28@cam.ac.uk Bao Sheng Loe Aiden
4: Zhiyuan (Jason) Xu xuxx0284@umn.edu Zhiyuan Jason Xu
5: Kuan-Yu (Alex) Chen alexkychen@gmail.com Kuan-Yu Alex Chen
We can then recombine the before and after into a full name via
> nick.dt[, full := sub(" *$", "", gsub(" +", " ", paste(before, after)))]
> nick.dt[nickname!=""][1:5, .(before, after, full)]
before after full
1: Jie Hu Jie Hu
2: Yifan Xu Yifan Xu
3: Bao Sheng Loe Bao Sheng Loe
4: Zhiyuan Xu Zhiyuan Xu
5: Kuan-Yu Chen Kuan-Yu Chen
Finally to query my packages,
> my.i <- nick.dt[, which(full=="Toby Dylan Hocking")]
> data.table(pkglist)[my.i, .(Package, Maintainer, Version)]
Package Maintainer Version
1: binsegRcpp Toby Dylan Hocking <toby.hocking@r-project.org> 2020.9.3
2: directlabels Toby Dylan Hocking <toby.hocking@r-project.org> 2020.6.17
3: inlinedocs Toby Dylan Hocking <toby.hocking@r-project.org> 2019.12.5
4: LOPART Toby Dylan Hocking <toby.hocking@r-project.org> 2020.6.29
5: namedCapture Toby Dylan Hocking <toby.hocking@r-project.org> 2020.4.1
6: nc Toby Dylan Hocking <toby.hocking@r-project.org> 2020.8.6
7: neuroblastoma Toby Dylan Hocking <toby@sg.cs.titech.ac.jp> 1.0
8: PeakError Toby Dylan Hocking <toby.hocking@r-project.org> 2017.06.19
9: PeakSegDisk Toby Dylan Hocking <toby.hocking@r-project.org> 2020.8.13
10: PeakSegDP Toby Dylan Hocking <toby.hocking@r-project.org> 2017.08.15
11: PeakSegJoint Toby Dylan Hocking <toby.hocking@r-project.org> 2018.10.3
12: PeakSegOptimal Toby Dylan Hocking <toby.hocking@r-project.org> 2018.05.25
13: penaltyLearning Toby Dylan Hocking <toby.hocking@r-project.org> 2020.5.13
14: WeightedROC Toby Dylan Hocking <toby.hocking@r-project.org> 2020.1.31
Indeed there seem to be 14 currently. The neuroblastoma
package is
the oldest one, submitted in 2013 during my postdoc in Japan (Tokyo
Tech).
How many other maintainers have more than one email?
> email.name.counts <- nick.dt[, .(count=.N), by=.(email, full.trans)]
> email.name.counts[full.trans=="Toby Dylan Hocking"]
email full.trans count
1: toby.hocking@r-project.org Toby Dylan Hocking 13
2: toby@sg.cs.titech.ac.jp Toby Dylan Hocking 1
> email.counts <- email.name.counts[, .(emails=.N), by=full.trans]
> email.counts[full.trans=="Toby Dylan Hocking"]
full.trans emails
1: Toby Dylan Hocking 2
> table(email.counts$emails)
1 2 3 4 5
8178 556 53 7 1
There are hundreds of maintainers with more than one email. Here are the maintainers with the most emails:
> big.emails <- email.counts[3 < emails]
> email.name.counts[big.emails, on="full.trans"]
email full.trans count emails
1: Michel.Ballings@GMail.com Michel Ballings 3 5
2: Michel.Ballings@UGent.be Michel Ballings 1 5
3: michel.ballings@GMail.com Michel Ballings 2 5
4: Michel.Ballings@gmail.com Michel Ballings 1 5
5: michel.ballings@gmail.com Michel Ballings 1 5
6: xl2473@cumc.columbia.edu Xiang Li 1 4
7: spiritcoke@gmail.com Xiang Li 1 4
8: ynaulx@gmail.com Xiang Li 1 4
9: xli256@its.jnj.com Xiang Li 1 4
10: simon.urbanek@r-project.org Simon Urbanek 10 4
11: simon.urbanek@R-project.org Simon Urbanek 3 4
12: Simon.Urbanek@r-project.org Simon Urbanek 14 4
13: Simon.Urbanek@R-project.org Simon Urbanek 1 4
14: Andrew.Parnell@mu.ie Andrew Parnell 1 4
15: andrew.parnell@ucd.ie Andrew Parnell 1 4
16: Andrew.Parnell@ucd.ie Andrew Parnell 1 4
17: andrew.parnell@mu.ie Andrew Parnell 1 4
18: qyzhao@statslab.cam.ac.uk Qingyuan Zhao 1 4
19: qz280@cam.ac.uk Qingyuan Zhao 1 4
20: qyzhao@wharton.upenn.edu Qingyuan Zhao 1 4
21: qingyzhao@gmail.com Qingyuan Zhao 2 4
22: john.maindonald@anu.edu.au John Maindonald 1 4
23: jhmaindonald@gmail.com John Maindonald 1 4
24: John.Maindonald@anu.edu.au John Maindonald 1 4
25: john@statsresearch.co.nz John Maindonald 1 4
26: mloos@envibee.ch Martin Loos 1 4
27: Martin.Loos@eawag.ch Martin Loos 2 4
28: loosmart@eawag.ch Martin Loos 1 4
29: mloos@looscomputing.ch Martin Loos 1 4
30: wei.jiang@polytechnique.edu Wei Jiang 1 4
31: wjiangaa@connect.ust.hk Wei Jiang 1 4
32: wjiang@kumc.edu Wei Jiang 1 4
33: jiangwei@hrbmu.edu.cn Wei Jiang 1 4
email full.trans count emails
Exercise for the reader: how to do a similar analysis as above but
with emails which are case insensitive? e.g. treat
michel.ballings@GMail.com
and Michel.Ballings@gmail.com
as the
same email.
Comparison with original post
Is this treatment the same as what they did in the original post? They did something analogous to the code below:
> stringr.trimmed <- stringr::str_trim(stringr.maint)
> orig.maint <- stringi::stri_trans_general(stringr.trimmed, "latin-ascii")
> (orig.top20 <- data.table(orig.maint)[, .(
+ count=.N
+ ), by=orig.maint][order(-count)][1:20])
orig.maint count
1: Scott Chamberlain 80
2: Dirk Eddelbuettel 61
3: Gabor Csardi 56
4: Jeroen Ooms 47
5: Hadley Wickham 46
6: Kartikeya Bolar 33
7: Robin K. S. Hankin 31
8: Bob Rudis 31
9: Henrik Bengtsson 30
10: Martin Maechler 29
11: Jan Wijffels 29
12: Simon Urbanek 28
13: Kurt Hornik 28
14: Max Kuhn 26
15: Thomas Lin Pedersen 25
16: John Muschelli 25
17: Torsten Hothorn 25
18: Jim Hester 25
19: Muhammad Yaseen 25
20: Yihui Xie 24
Is stri_trans_general
necessary? Yes, for these 22 people who have
more than one way to write their names:
> translation.different <- unique(data.table(
+ stringr.trimmed, orig.maint)[stringr.trimmed != orig.maint])
> translation.also.in.trimmed <- unique(stringr.trimmed[
+ stringr.trimmed %in% translation.different$orig.maint])
> translation.different[orig.maint %in% translation.also.in.trimmed]
stringr.trimmed orig.maint
1: Aurélie Siberchicot Aurelie Siberchicot
2: Gábor Csárdi Gabor Csardi
3: M. Helena Gonçalves M. Helena Goncalves
4: Josué M. Polanco-Martínez Josue M. Polanco-Martinez
5: Gergely Daróczi Gergely Daroczi
6: Kauê de Sousa Kaue de Sousa
7: Raphaël Bonnet Raphael Bonnet
8: Joël Gombin Joel Gombin
9: Charles-Édouard Giguère Charles-Edouard Giguere
10: Virgilio Gómez-Rubio Virgilio Gomez-Rubio
11: Øyvind Langsrud Oyvind Langsrud
12: José Cláudio Faria Jose Claudio Faria
13: Hervé Perdry Herve Perdry
14: Alí Santacruz Ali Santacruz
15: Mickaël Canouil Mickael Canouil
16: Björn Andersson Bjorn Andersson
17: Juraj Szitás Juraj Szitas
18: Alejandro Jiménez Rico Alejandro Jimenez Rico
19: Przemysław Biecek Przemyslaw Biecek
20: Dominik Krzemiński Dominik Krzeminski
21: Samuel Macêdo Samuel Macedo
22: Oscar Perpiñán Lamigueiro Oscar Perpinan Lamigueiro
stringr.trimmed orig.maint
Does nc compute the same top 20? YES.
> nick.dt[, full.trans := stringi::stri_trans_general(full, "latin-ascii") ]
> nc.sorted <- nick.dt[, .(
+ count=.N
+ ), by=.(orig.maint=full.trans)][order(-count)]
> nc.top20 <- nc.sorted[1:20]
> identical(nc.top20, orig.top20)
[1] TRUE
But what about the other names? There are 59 which are not the same:
> unique(data.table(
+ nc.maint, orig.maint
+ )[nc.maint != orig.maint | is.na(orig.maint)])
nc.maint orig.maint
1: Jie Hu Jie
2: Yifan Xu Yifan
3: Mu Sigma, Inc. Mu Sigma
4: Eoghan T O Halloran Eoghan T O
5: Zhiyuan Xu Zhiyuan
6: Kuan-Yu Chen Kuan-Yu
7: Eric D., Feigelson Eric D.
8: Alan O Callaghan Alan O
9: Yang Liu Yang
10: Alexander Pastukhov Alexander
11: Sy Han Chiou Sy Han
12: Xi LUO Xi
13: Ting Fung Ma Ting Fung
14: Xi Luo Xi
15: Mark O Connell Mark O
16: Antonio D Ambrosio Antonio D
17: Kropko, Jonathan Kropko
18: Yanwei Zhang Yanwei
19: Simon, Reinhard Simon
20: Xiaorui Zhu Xiaorui
21: Garcia-Rodenas, Alvaro Garcia-Rodenas
22: Brian P. O Connor Brian P. O
23: Mitchell O Hara-Wild Mitchell O
24: Pedro H. C. Sant Anna Pedro H. C. Sant
25: Seunggeun Lee Seunggeun
26: Hsiang Hao, Chen Hsiang Hao
27: Joshua O Brien Joshua O
28: Xiaofei Wang Xiaofei
29: Xuanhua Yin Xuanhua
30: Hui Tang Hui
31: Michael O Neill Michael O
32: Markus Loecher, Berlin School of Economics and Law Markus Loecher
33: Na Li Na
34: Antoine Tremblay, Dalhousie University Antoine Tremblay
35: ORPHANED <NA>
36: Christoph Hoeppke Christoph Hoeppke
37: Jared O Connell Jared O
38: Tao Liu, PhD Tao Liu
39: Eoin O Connell Eoin O
40: Lauren O Brien Lauren O
41: Silvia D Angelo Silvia D
42: Suhai Liu Suhai
43: Ming-Chang Lee Ming-Chang
44: Patrick O Keefe Patrick O
45: Nezami,Hossein Nezami
46: Varhegyi, Nikole E. Varhegyi
47: Jose Gama Jose
48: Tim Triche, Jr. Tim Triche
49: Xiuquan Wang Xiuquan
50: Shawn T. O Neil Shawn T. O
51: Matteo Dell Omodarme Matteo Dell
52: Lorenzo D Andrea Lorenzo D
53: Eamon O Dea Eamon O
54: Nepomechie, David Israel Nepomechie
55: Marcello D Orazio Marcello D
56: Lucy D Agostino McGowan Lucy D
57: James P. Howard, II James P. Howard
58: Robert Myles McDonnell Robert Myles McDonnell
59: Zekun Xu Zekun
nc.maint orig.maint
Some issues to fix (exercise for the reader)
- nc produced
Nezami,Hossein
which would needs a space. - orig method produced
Robert Myles McDonnell
which has an extra space. - nc produced both
Xi LUO
andXi Luo
which are the same person but different text strings. - nc produces
ORPHANED
whereas the original method produced missing values,NA
. Not to be confused with the nameNa
(orig) orNa Li
(nc).
Where do I appear in the ranking? In the top 50, out of almost 9000 package maintainers! That is the top 1%.
> nc.sorted[, Rank := rank(-count)]
> (N.maintainers <- nrow(nc.sorted))
[1] 8795
> nc.sorted[, percent := 100*Rank/N.maintainers]
> nc.sorted[orig.maint=="Toby Dylan Hocking"]
orig.maint count rank Rank percent
1: Toby Dylan Hocking 14 50 50 0.5685048
Finally how many packages do you need to get into the top x%?
> unique(nc.sorted[, .(count, Rank, percent)])
count Rank percent
1: 80 1.0 0.01137010
2: 61 2.0 0.02274019
3: 56 3.0 0.03411029
4: 47 4.0 0.04548039
5: 46 5.0 0.05685048
6: 33 6.0 0.06822058
7: 31 7.5 0.08527572
8: 30 9.5 0.10801592
9: 29 11.0 0.12507106
10: 28 12.5 0.14212621
11: 26 14.0 0.15918135
12: 25 17.0 0.19329164
13: 24 21.0 0.23877203
14: 23 23.0 0.26151222
15: 22 24.5 0.27856737
16: 21 26.0 0.29562251
17: 20 27.5 0.31267766
18: 19 30.5 0.34678795
19: 18 35.5 0.40363843
20: 16 40.0 0.45480387
21: 15 44.0 0.50028425
22: 14 50.0 0.56850483
23: 13 58.5 0.66515065
24: 12 69.5 0.79022172
25: 11 87.5 0.99488346
26: 10 111.5 1.26776578
27: 9 137.5 1.56338829
28: 8 177.0 2.01250711
29: 7 237.5 2.70039795
30: 6 319.0 3.62706083
31: 5 433.0 4.92325185
32: 4 646.0 7.34508243
33: 3 1086.0 12.34792496
34: 2 2137.0 24.29789653
35: 1 5844.5 66.45252985
count Rank percent
So you need 11 packages to be in the top 1%, 4 packages to be in the top 10%, etc.