Collaborations not allowed
The goal of this post is to explore what countries and institutions are not allowed in collaborations with researchers in Canada.
Introduction
When applying for grants at my new job at Université de Sherbrooke in Canada, I was asked to agree to not share research “secrets” with a certain list of organizations, on a web page. This is silly to ask of most researchers, who publish in peer-reviewed journals/conferences that are more or less publicly accessible for anyone. If anyone can access my research, how can I prevent sharing with these organizations? This makes me think of the McCarthy-era communist witch hunt in the USA in the 1950s. Are we going to soon have to swear loyalty, or renounce a certain idealogy, to continue our jobs at the university? I find this dubious.
Simple example
A good example of what the data look like is below:
some.html <- '
<li><strong>A.A. Kharkevich Institute for Information Transmission Problems, IITP, Russian Academy of Sciences</strong> (Russia)</li>
<li><strong>Academy of Military Medical Sciences</strong> (People’s Republic of China)
<p>Known alias(es): AMMS</p>
</li>
<li><strong>Academy of Military Science</strong> (People’s Republic of China)
<p>Known alias(es): AMS</p>
</li>
<li><strong>Aerospace Research Institute</strong> (Iran)
<p>Known alias(es): ARI</p>
</li>
'
We parse name in each block using the regex below,
name.pattern <- list(
"\n\t<li><strong>",
name=".*?",
"</strong>")
nc::capture_all_str(
some.html,
name.pattern)
## name
## <char>
## 1: A.A. Kharkevich Institute for Information Transmission Problems, IITP, Russian Academy of Sciences
## 2: Academy of Military Medical Sciences
## 3: Academy of Military Science
## 4: Aerospace Research Institute
The table above has four rows and one column. We can add another column for country via the code below.
country.pattern <- list(
" [(]",
country=".*?",
" *[)]")
nc::capture_all_str(
some.html,
name.pattern,
country.pattern)
## name
## <char>
## 1: A.A. Kharkevich Institute for Information Transmission Problems, IITP, Russian Academy of Sciences
## 2: Academy of Military Medical Sciences
## 3: Academy of Military Science
## 4: Aerospace Research Institute
## country
## <char>
## 1: Russia
## 2: People’s Republic of China
## 3: People’s Republic of China
## 4: Iran
The table above has two columns. To add a final column for aliases, we use the code below.
aliases.pattern <- list(
"\n\t<p>Known alias[(]es[)]: ",
aliases=".*?",
"</p>")
nca.pattern <- list(
name.pattern,
country.pattern,
nc::quantifier(aliases.pattern, "?"))
nc::capture_all_str(some.html, nca.pattern)
## name
## <char>
## 1: A.A. Kharkevich Institute for Information Transmission Problems, IITP, Russian Academy of Sciences
## 2: Academy of Military Medical Sciences
## 3: Academy of Military Science
## 4: Aerospace Research Institute
## country aliases
## <char> <char>
## 1: Russia
## 2: People’s Republic of China AMMS
## 3: People’s Republic of China AMS
## 4: Iran ARI
Download and parse
orgs.html <- "../assets/2024-09-09-canada-list-of-risky-research-orgs.html"
if(!file.exists(orgs.html)){
u <- "https://science.gc.ca/site/science/en/safeguarding-your-research/guidelines-and-tools-implement-research-security/sensitive-technology-research-and-affiliations-concern/named-research-organizations"
download.file(u, orgs.html)
}
orgs.lines <- readLines(orgs.html)
(n.strong <- length(grep("<strong>",orgs.lines)))
## [1] 103
orgs.dt <- nc::capture_all_str(orgs.lines, nca.pattern)
nrow(orgs.dt)
## [1] 103
The number of rows above seems to agree with the number of <strong>
tags (simpler pattern).
Below we check the number of aliases.
sum(orgs.dt$aliases!="")
## [1] 75
aliases.lines <- grep("alias(es)", orgs.lines, fixed=TRUE, value=TRUE)
length(aliases.lines)
## [1] 94
Above it looks like there were some aliases not parsed. Which ones?
alias.dt <- nc::capture_first_vec(paste0("\n",aliases.lines), aliases.pattern)
alias.dt[!orgs.dt,.(aliases40=substr(aliases,1,40)),on="aliases"]
## aliases40
## <char>
## 1: BMSU; Bagiatollah Medical Sciences Unive
## 2: HPSTAR; Beijing High Voltage Science Res
## 3: PLA Dalian Naval Academy
## 4: PAP Engineering University
## 5: HEU
## 6: Imam Hussein University; IHU; Imam Hosse
## 7: Institute of Cadre Management; Institute
## 8: KLISE
## 9: PAP Logistics University
## 10: Forensic Identification Center of the Mi
## 11: PLA Nanjing Army Command College
## 12: PAP Officers' College; People's Armed Po
## 13: People's Armed Police NCO College
## 14: Ministry of Public Security Railway Poli
## 15: SBU; Martyr Baheshti University; Univers
## 16: TJU
## 17: UESTC
## 18: XATU
## 19: 27th NTs
Trying again
Looking at the ones that did not match, it seems that there are some empty lines which are optional.
odd.html <- '
<li><strong>Center for High Pressure Science and Technology Advanced Research</strong> (People’s Republic of China)
<p>Known alias(es): HPSTAR; Beijing High Voltage Science Research Center</p>
</li>
<li><strong>Engineering University of the CAPF</strong> (People’s Republic of China)
<p>Known alias(es): PAP Engineering University</p>
</li>
<li><strong>Explosion and Impact Technology Research Centre</strong> (Iran)
<p>Known alias(es): Research Centre for Explosion and Impact; METFAZ</p>
</li>
<li><strong>Institute of NBC Defense</strong> (People’s Republic of China)</li>
'
aliases.plus.pattern <- list(
"\n+\t<p>Known alias[(]es[)]: ", #added a +
aliases=".*?",
"</p>")
nca.plus.pattern <- list(
name.pattern,
country.pattern,
nc::quantifier(aliases.plus.pattern, "?"))
nc::capture_all_str(some.html, nca.plus.pattern)
## name
## <char>
## 1: A.A. Kharkevich Institute for Information Transmission Problems, IITP, Russian Academy of Sciences
## 2: Academy of Military Medical Sciences
## 3: Academy of Military Science
## 4: Aerospace Research Institute
## country aliases
## <char> <char>
## 1: Russia
## 2: People’s Republic of China AMMS
## 3: People’s Republic of China AMS
## 4: Iran ARI
nc::capture_all_str(odd.html, nca.plus.pattern)
## name country
## <char> <char>
## 1: Center for High Pressure Science and Technology Advanced Research People’s Republic of China
## 2: Engineering University of the CAPF People’s Republic of China
## 3: Explosion and Impact Technology Research Centre Iran
## 4: Institute of NBC Defense People’s Republic of China
## aliases
## <char>
## 1: HPSTAR; Beijing High Voltage Science Research Center
## 2: PAP Engineering University
## 3: Research Centre for Explosion and Impact; METFAZ
## 4:
Results above look great! Let’s try it again below.
plus.dt <- nc::capture_all_str(orgs.lines, nca.plus.pattern)
nrow(plus.dt)
## [1] 103
n.strong
## [1] 103
sum(plus.dt$aliases!="")
## [1] 94
length(aliases.lines)
## [1] 94
Numbers agreeing in the output above indicate that the data were parsed correctly.
Country analysis
plus.dt[, .(organizations=.N), by=country]
## country organizations
## <char> <int>
## 1: Russia 6
## 2: People’s Republic of China 85
## 3: Iran 12
The output above indicates there are just three countries with organizations to avoid.
Aliases
How many aliases per organization?
(a.dt <- plus.dt[
, alias.list := strsplit(aliases,split="; ")
][
, .(alias=alias.list[[1]]), by=name
])
## name alias
## <char> <char>
## 1: Academy of Military Medical Sciences AMMS
## 2: Academy of Military Science AMS
## 3: Aerospace Research Institute ARI
## 4: Air Force Medical University Air Force Medical University
## 5: Air Force Medical University Air Force Military Medical University
## ---
## 250: 48th Central Scientific Research Institute Military Technical Scientific Research Institute
## 251: 48th Central Scientific Research Institute Center for Military Technical Problems of Biological Defense
## 252: 48th Central Scientific Research Institute 48th TsNII Kirov
## 253: 48th Central Scientific Research Institute Scientific Research Institute of Microbiology
## 254: 48th Central Scientific Research Institute Scientific Research Institute of Epidemiology and Hygiene
tibble::tibble(plus.dt[
, n.alias := sapply(alias.list, length)
]) # for nice print.
## # A tibble: 103 × 5
## name country aliases alias.list n.alias
## <chr> <chr> <chr> <list> <int>
## 1 A.A. Kharkevich Institute for Information Transmission Problems, IITP, Russian Ac… Russia "" <chr [0]> 0
## 2 Academy of Military Medical Sciences People… "AMMS" <chr [1]> 1
## 3 Academy of Military Science People… "AMS" <chr [1]> 1
## 4 Aerospace Research Institute Iran "ARI" <chr [1]> 1
## 5 Air Force Medical University People… "Air F… <chr [4]> 4
## 6 Air Force Research Institute People… "Air F… <chr [2]> 2
## 7 Air Force Xi’an Flight Academy People… "PLA A… <chr [3]> 3
## 8 Airforce Command College People… "PLA A… <chr [4]> 4
## 9 Airforce Communication NCO Academy People… "Dalia… <chr [1]> 1
## 10 Airforce Early Warning Academy People… "Wuhan… <chr [1]> 1
## # ℹ 93 more rows
## another way to get the count is via .N in join:
a.dt[plus.dt, .(.N=.N, n.alias), on='name', by=.EACHI]
## name .N n.alias
## <char> <int> <int>
## 1: A.A. Kharkevich Institute for Information Transmission Problems, IITP, Russian Academy of Sciences 0 0
## 2: Academy of Military Medical Sciences 1 1
## 3: Academy of Military Science 1 1
## 4: Aerospace Research Institute 1 1
## 5: Air Force Medical University 4 4
## ---
## 99: Xi'an Technological University 1 1
## 100: 27th Scientific Center of the Russian Ministry of Defense 1 1
## 101: 33rd Scientific Research and Testing Institute 1 1
## 102: 46th TSNII Central Scientific Research Institute 2 2
## 103: 48th Central Scientific Research Institute 10 10
Conclusions
We used regular expressions to help us understand that Canada does not want researchers collaborating with certain organizations in three countries: Russia, Iran, China.
Session info
sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C LC_TIME=English_United States.utf8
##
## time zone: America/Toronto
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics utils datasets grDevices methods base
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.4 xfun_0.47 nc_2024.9.20 magrittr_2.0.3 glue_1.7.0 tibble_3.2.1
## [7] knitr_1.48 pkgconfig_2.0.3 lifecycle_1.0.4 cli_3.6.3 fansi_1.0.6 vctrs_0.6.5
## [13] data.table_1.16.99 compiler_4.4.1 tools_4.4.1 evaluate_0.24.0 pillar_1.9.0 rlang_1.1.4