7  Wrangling bibliometrics data

WORK IN PROGRESS

Summary

The bibliometrics team have a spreadsheet detailing published journal articles from which they would like to understand, “Which UoS department is affiliated to each article?”

Here we go through the process of wrangling the data to answer this question and produce some summary statistics.

It’s time to apply our knowledge to some bibliometrics data. Before we begin, I’ll assume we’ve got the tidyverse loaded into our environment in RStudio.

7.1 Describing the problem

I’ll attempt to describe the problem as I understand it, and then break it down into parts that correspond with Chapter 3 , Chapter 5 and Chapter 6.

Here’s an excerpt from an email Kate sent me on 2023-11-21:

[A spreadsheet] is produced from a fortnightly check… for REF and Funder Compliance… the Export Control Team periodically asks us for information about particular staff/institutions/countries and…trying to search in the Affiliations can be very time consuming…

From what I can see, the overarching problem is this:

  1. There are three data sources: Scopus, Web of Science and Pure, from which tables about publications are regularly downloaded:

    All three tables contain DOIs for the publications, but only Scopus has the UoS department affiliations, and Pure contains the Open Access compliance information

  2. The data contained in these three tables needs to be joined, and then joined to other data.

  3. The final output is a table indicating the open access status, funding body compliance status and University department affiliation for each publication.

Having spent a bit of time looking at the different data, I can see this is a problem that requires a mixture of several dpylr and tidyr functions. I can also see that this problem cannot be entirely automated as some of the data needed is not in the three tables.

To join tables we need common variables between tables: these will be the publication title and the digital object identifier (DOI), but these will have different names so we need to account for that.

7.2 Downloading the bibliometrics data

Here I have three example datasets and identify the common variables by which we might join them.

7.2.1 Pure data

I don’t have Admin access to Pure so Kate had to send me some data which I saved as a csv file for us to use, but the process once logged into Pure is:

  • Go to report definitions on the side menu, click on Managed report definitions – Add as favourite the compliance reporting link.

  • Click on [ComplianceReporting] Articles/Conferences Open Access Data for Current + Previous Year. It will open in a new window.

  • Click on XLS at the bottom of the window to create the report. The report will be generated. Click on Download XLS.

  • Click on the downloaded file and save to your folder.

I then opened the file in Excel and saved it as a csv so that we can load the example Pure data directly from the repository of this book.

The table has 5,186 rows and 21 columns:

  • The DOI is in Electronic version(s) of this work > DOI (Digital Object Identifier)-0
  • The publication title is Title of the contribution in original language-5
# Load the Pure csv file
pure_dat <- read_csv("https://github.com/ab604/library-r/raw/main/data/pure-2024-03-26.csv")

# Inspect the contents
glimpse(pure_dat)
Rows: 5,186
Columns: 21
$ `Electronic version(s) of this work > DOI (Digital Object Identifier)-0` <chr> …
$ `Additional source IDs-1`                                                <chr> …
$ `UUID-2`                                                                 <chr> …
$ `ID-3`                                                                   <dbl> …
$ `Managing organisational unit-4`                                         <chr> …
$ `Title of the contribution in original language-5`                       <chr> …
$ `Subtitle of the contribution in original language-6`                    <chr> …
$ `Publication statuses and dates > E-pub ahead of print-7`                <chr> …
$ `Publication statuses and dates > Accepted/In press-8`                   <chr> …
$ `Workflow > Step-9`                                                      <chr> …
$ `Electronic version(s) of this work > Document version-10`               <chr> …
$ `Electronic version(s) of this work > Licence to document-11`            <chr> …
$ `Managing organisational unit-12`                                        <chr> …
$ `Open Access status-13`                                                  <chr> …
$ `REF Open Access Compliance Status-14`                                   <chr> …
$ `REF2021 exception-15`                                                   <chr> …
$ `Current publication status > Date-16`                                   <dbl> …
$ `Journal > Journal-17`                                                   <chr> …
$ `Event-18`                                                               <chr> …
$ `Title of the host publication in original language-19`                  <chr> …
$ `Type-20`                                                                <chr> …

7.2.2 Scopus data

I downloaded some data as a csv file from Scopus Advanced Search to generate a report for 31 days prior to 2024-03-13 using the following query string:

AF-ID("University of Southampton" 60025225) AND PUBYEAR > 2017 AND PUBYEAR < 2025 AND ( LIMIT-TO ( DOCTYPE,"ar" ) OR LIMIT-TO ( DOCTYPE,"re" ) OR LIMIT-TO ( DOCTYPE,"cp" ) OR LIMIT-TO ( DOCTYPE,"ed" ) OR LIMIT-TO ( DOCTYPE,"le" ) ) AND RECENT(31)

As before, I saved it so that we can load the example Scopus data directly from the repository of this book.

The table has 483 rows and 20 columns:

  • The title of each publication is in the Title column.
  • The DOI is in the DOI column
  • Affiliations contains the all institution departments, separated by semi-colons. Hence we have multiple values in each cell.
# Load the scopus csv file
scopus_dat <- read_csv("https://github.com/ab604/library-r/raw/main/data/scopus-2024-03-13.csv")

# Inspect the contents
glimpse(scopus_dat)
Rows: 483
Columns: 20
$ Authors                     <chr> "Liu Z.; Ettabib M.A.; Bowden B.M.; Bartle…
$ `Author full names`         <chr> "Liu, Zhen (57220604023); Ettabib, Mohamed…
$ `Author(s) ID`              <chr> "57220604023; 36662229200; 57225044665; 15…
$ Title                       <chr> "Multiframe-based non-local means denoisin…
$ Year                        <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, …
$ `Source title`              <chr> "Spectrochimica Acta - Part A: Molecular a…
$ DOI                         <chr> "10.1016/j.saa.2024.123931", "10.1113/JP28…
$ Link                        <chr> "https://www.scopus.com/inward/record.uri?…
$ Affiliations                <chr> "Zepler Institute for Photonics and Nanoel…
$ `Authors with affiliations` <chr> "Liu Z., Zepler Institute for Photonics an…
$ `Funding Details`           <chr> "Defence Science and Technology Laboratory…
$ `Correspondence Address`    <chr> "Z. Liu; Zepler Institute for Photonics an…
$ Publisher                   <chr> "Elsevier B.V.", "John Wiley and Sons Inc"…
$ ISSN                        <chr> "13861425", "00223751", "00298018", "00489…
$ ISBN                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ CODEN                       <chr> "SAMCA", "JPHYA", NA, "STEVA", NA, NA, "EN…
$ `Document Type`             <chr> "Article", "Article", "Article", "Article"…
$ `Open Access`               <chr> NA, "All Open Access; Hybrid Gold Open Acc…
$ Source                      <chr> "Scopus", "Scopus", "Scopus", "Scopus", "S…
$ EID                         <chr> "2-s2.0-85183956855", "2-s2.0-85186478720"…

7.2.3 Web of Science data

I followed these steps to download Web of Science (WoS) data on 2024-03-26 as a tsv file.

On Web of Science:

  • Under documents select AFFILIATIONS and UNIVERSITY OF SOUTHAMPTON (Note that the computer may remember the last search).

  • Click on Add date range – select Publication date down arrow and select Custom. Enter date range – last date of report drawn and current date eg. 2022-04-04 to 2022-04-25 and click search.

  • Limit the results to 2018 onwards.

  • Click on Document Types and select the following categories and click on Refine.

    1. Articles

    2. Early access

    3. Review articles

    4. Editorial materials

    5. Proceeding papers

    6. Letter

  • Click EXPORT – select Tab delimited file. New window will open. Select All records. Under Record Content - select Full Record. Click Export.

As before, I saved it so that we can load the example WoS data directly from the repository of this book.

It has 50 rows with 71 columns all with two letter field tags

  • The title of each publication is in the TI column.
  • The DOI is in the DI column
  • C1 is the field tag for Author address which also contains institution department information, but as with the Scopus data it contains multiple values in each cell.
# Load the Web of science tsv file
wos_dat <- read_tsv("https://github.com/ab604/library-r/raw/main/data/web-of-science-2024-03-26.tsv")

# Inspect at the contents
glimpse(wos_dat)
Rows: 50
Columns: 71
$ PT <chr> "J", "J", "J", "J", "J", "J", "J", "J", "J", "J", "J", "J", "J", "J…
$ AU <chr> "Saunders, B", "Messiou, K; de los Reyes, J; Potnis, C; Dong, P; Rw…
$ BA <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ BE <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ GP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ AF <chr> "Saunders, Ben", "Messiou, Kyriaki; de los Reyes, Jay; Potnis, Chin…
$ BF <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ CA <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ TI <chr> "Fetuses are not adult humans: a response to Miller on abortion", "…
$ SO <chr> "JOURNAL OF MEDICAL ETHICS", "INTERNATIONAL JOURNAL OF INCLUSIVE ED…
$ SE <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ BS <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ LA <chr> "English", "English", "English", "English", "English", "English", "…
$ DT <chr> "Editorial Material; Early Access", "Article; Early Access", "Artic…
$ CT <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ CY <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ CL <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ SP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ HO <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ DE <chr> "Abortion - Induced; Ethics; Fetus; Moral Status; Personhood", "Stu…
$ ID <chr> NA, "EDUCATION; POWER", "ACHIEVEMENT INEQUALITY; EDUCATIONAL-INEQUA…
$ AB <chr> "Miller has recently argued that fetuses have the same inherent val…
$ C1 <chr> "[Saunders, Ben] Univ Southampton, Polit & IR, Southampton, England…
$ C3 <chr> "University of Southampton", "University of Southampton", "Universi…
$ RP <chr> "Saunders, B (corresponding author), Univ Southampton, Polit & IR, …
$ EM <chr> "b.m.saunders@soton.ac.uk", "k.messiou@soton.ac.uk", "n.pensiero@so…
$ RI <chr> NA, NA, NA, NA, NA, NA, NA, "; gardes, frederic/C-1619-2013; Gow, P…
$ OI <chr> "Saunders, Ben/0000-0002-5147-6397", "Messiou, Kyriaki/0000-0003-34…
$ FU <chr> NA, "UKRI10.13039/100014013", NA, NA, NA, "Engineering and Physical…
$ FP <chr> NA, "UKRI10.13039/100014013", NA, NA, NA, "Engineering and Physical…
$ FX <chr> NA, "We would like to thank all children and teachers who took part…
$ CR <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ NR <dbl> 7, 50, 73, 82, 3, 60, 35, 11, 0, 47, 102, 6, 101, 2, 15, 76, 40, 61…
$ TC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ Z9 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ U1 <dbl> 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 1, 2, 1, 0, 0…
$ U2 <dbl> 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 1, 2, 1, 0, 0…
$ PU <chr> "BMJ PUBLISHING GROUP", "ROUTLEDGE JOURNALS, TAYLOR & FRANCIS LTD",…
$ PI <chr> "LONDON", "ABINGDON", "ABINGDON", "LONDON", "HOBOKEN", "WASHINGTON"…
$ PA <chr> "BRITISH MED ASSOC HOUSE, TAVISTOCK SQUARE, LONDON WC1H 9JR, ENGLAN…
$ SN <chr> "0306-6800", "1360-3116", "1747-7778", "1468-215X", "0007-1315", "2…
$ EI <chr> "1473-4257", "1464-5173", "1747-7786", "1473-4265", "1468-4446", NA…
$ BN <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ J9 <chr> "J MED ETHICS", "INT J INCLUSIVE EDUC", "J SIMUL", "MED HUMANIT", "…
$ JI <chr> "J. Med. Ethics", "Int. J. Incl. Educ.", "J. Simul.", "Med. Humanit…
$ PD <chr> "2024 FEB 23", "2024 FEB 20", "2024 MAR 10", "2024 FEB 15", "2024 F…
$ PY <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2024, 2…
$ VL <dbl> NA, NA, NA, NA, NA, NA, NA, 60, NA, NA, NA, 12, 6, NA, 8, NA, NA, 6…
$ IS <dbl> NA, NA, NA, NA, NA, NA, NA, 5, NA, NA, 2, 1, NA, NA, 3, NA, NA, 3, …
$ PN <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ SU <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ SI <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ MA <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ BP <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ EP <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ AR <chr> NA, NA, NA, NA, NA, NA, NA, "e13138", NA, NA, "90", "010201", "1245…
$ DI <chr> "10.1136/jme-2024-109854", "10.1080/13603116.2024.2317729", "10.108…
$ DL <chr> "http://dx.doi.org/10.1136/jme-2024-109854", "http://dx.doi.org/10.…
$ D2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ EA <chr> "FEB 2024", "FEB 2024", "MAR 2024", "FEB 2024", "FEB 2024", "MAR 20…
$ PG <dbl> 2, 15, 15, 10, 3, 10, 25, 3, 2, 19, 52, 2, 13, 1, 4, 21, 13, 22, 10…
$ WC <chr> "Ethics; Medical Ethics; Social Issues; Social Sciences, Biomedical…
$ WE <chr> "Science Citation Index Expanded (SCI-EXPANDED); Social Science Cit…
$ SC <chr> "Social Sciences - Other Topics; Medical Ethics; Social Issues; Bio…
$ GA <chr> "JS7M6", "JA7K3", "LN5N2", "ID4Y4", "HW8F9", "KP6M7", "KP2U6", "KJ8…
$ PM <dbl> 38395622, NA, NA, 38360797, NA, NA, NA, NA, NA, NA, NA, NA, NA, 384…
$ OA <chr> NA, "hybrid", "hybrid", NA, NA, "hybrid", "hybrid, Green Submitted"…
$ HC <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ HP <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ DA <date> 2024-03-26, 2024-03-26, 2024-03-26, 2024-03-26, 2024-03-26, 2024-0…
$ UT <chr> "WOS:001175218400001", "WOS:001170495400001", "WOS:001187496600001"…

7.3 Make it simple

To make it simple - well simpler! - lets use a single entry from each table to understand how we might tackle the whole data set.

Manual inspection tells me that all three tables have a publication entitled, Lipid droplets in steatotic liver disease with a DOI of 10.1097/MCO.0000000000000993.

Using filter I extract the corresponding row from each table to inspect them.

# Pure title variable is `Title of the contribution in original language-5`
pure_lipid <- pure_dat |> 
  filter(`Title of the contribution in original language-5` == 
           "Lipid droplets in steatotic liver disease") 
#  Scopus title variable is Title
scopus_lipid <- scopus_dat |> 
  filter(Title == "Lipid droplets in steatotic liver disease")
# WoS title variable is TI
wos_lipid <- wos_dat |> 
  filter(TI == "Lipid droplets in steatotic liver disease")

# Print number of rows and columns with dim_desc()
dim_desc(pure_lipid)
[1] "[1 x 21]"
dim_desc(scopus_lipid)
[1] "[1 x 20]"
dim_desc(wos_lipid)
[1] "[1 x 71]"

7.3.1 Joining the small table

We have a choice about whether to join the tables first or deal with the problem of having multiple values in the Affiliation and C1 cells first?

Let’s do the joins first. This is an easier problem, but also as joining also filters, it reduces the number of rows and therefore when it comes to the full data, it reduces the amount of departments we need to tidy.

We’ll use our single row pure_lipid object to perform filtering joins on the full scopus_dat and wos_dat data using left_join() as per Section 6.3.

As the variables for title and DOI have different names we have to explicitly tell the join function how to map these variables using by and passing character vectors stating the mapping as below. I’m using two pipes and doing both joins prior to assigning a new object lipid_join.

We start with 1 row and 21 variables in pure_lipid and join on two variables, meaning we add 18 variables from scopus_dat and 69 variables from wos_dat to create a table with 1 row and 21 + 18 + 69 = 108 columns.

lipid_join <- pure_lipid |> 
  left_join(scopus_dat, by = c(`Title of the contribution in original language-5` = "Title",
                               `Electronic version(s) of this work > DOI (Digital Object Identifier)-0` = "DOI")) |> 
  left_join(wos_dat, by = c(`Title of the contribution in original language-5` = "TI" , 
                           `Electronic version(s) of this work > DOI (Digital Object Identifier)-0` = "DI"))

# Print number of rows and columns with dim_desc()
dim_desc(lipid_join)
[1] "[1 x 108]"

7.3.2 Separating the Scopus Affiliations

I’ll use select to look at Affiliations column in lipid_join that originates in the Scopus data, and we can see there are three ; separating three values for departments: two are University of Southampton and one is from the University of Pennsylvania.

Listing 7.1: Use select to see how many values are in the Affiliations cell
lipid_join |> 
  select(Affiliations) |> 
  flextable() |> 
  font(part = "all", fontname = "Consolas") |>
  bg(bg = "white", part = "all") |> 
  autofit()

Affiliations

School of Human Development and Health, Faculty of Medicine, University of Southampton, United Kingdom; National Institute for Health Research, Southampton Biomedical Research Centre, University of Southampton, University Hospital Southampton, National Health Service Foundation Trust, Southampton, United Kingdom; Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States

We can use separate_wider_delim() to split these three departments on the ;.

We will provide the following arguments to the function:

  • cols = Affiliations this tells the function which column to separate into individual variables.
  • delim = ";" this tells the function to split on the semi-colon. Note we supply a string.
  • names_sep = "_" this automatically creates new names based on the original variable, Affiliations separated by _ and numbered sequentially.
  • too_few = "align_start" this we don’t need now, but when we have different numbers of departments in each row, it will number them from 1 and create NA cells if there are empty cells.

I’m not assigning an object as I want to see how this looks.

Listing 7.2: Separating the Affiliation values in lipid_join into individual columns using separate_wider_delim()
lipid_join |> 
  select(Affiliations) |> 
  separate_wider_delim(cols = Affiliations,
                       names_sep = "_",
                       delim = ";") |> 
  flextable() |> 
  font(part = "all", fontname = "Consolas") |>
  bg(bg = "white", part = "all") |> 
  autofit()

Affiliations_1

Affiliations_2

Affiliations_3

School of Human Development and Health, Faculty of Medicine, University of Southampton, United Kingdom

National Institute for Health Research, Southampton Biomedical Research Centre, University of Southampton, University Hospital Southampton, National Health Service Foundation Trust, Southampton, United Kingdom

Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States

That seems to work, so lets assign the output to a new object lipid_wide. We expect to add two new columns, lipid_join has 108 columns, we expect lipid_wide to have 110 columns.

Listing 7.3: Separating the Affiliation values in scopus_dat into individual columns using separate_wider_delim() and creating a new object.
lipid_wide <- lipid_join |> 
  separate_wider_delim(cols = Affiliations, 
                       delim = ";", 
                       names_sep = "_", 
                       too_few = "align_start")

# Print number of rows and columns with dim_desc()
dim_desc(lipid_wide) 
[1] "[1 x 110]"

7.3.3 Separating the Web of Science C1 column

As described in Section 7.2.3, C1 is the field tag for Author address which also contains institution department information, but as with the Scopus data this column contains multiple values in each cell.

To make matters worse, the values for the departments aren’t separated using a simple delimiter.

Let’s look C1 in lipid_wide:

lipid_wide |> 
  select(C1) |> 
  flextable() |> 
  autofit()

C1

[Bilson, Josh; Scorletti, Eleonora] Univ Southampton, Fac Med, Sch Human Dev & Hlth, Southampton, England; [Bilson, Josh; Scorletti, Eleonora] Univ Southampton, Natl Inst Hlth Res Southampton, Biomed Res Ctr, Southampton, England; [Bilson, Josh; Scorletti, Eleonora] Univ Hosp Southampton Natl Hlth Serv Fdn Trust, Southampton, England; [Scorletti, Eleonora] Univ Penn, Perelman Sch Med, Dept Genet, Philadelphia, PA 19104 USA

What we’ve got are ; between the names in the square brackets and after each country.

My solution here is to break the problem down into steps:

  1. Separate as before using the ; delimiter. This will yield columns that have names or addresses, each column name starts C1_.

  2. Pivot the table to be longer such that the names of the new columns C1_1, C1_2 etc. from step 1 become rows in default variable name and values populate default variable value.

  3. Next, extract the string with the University of Southampton information from each row to a new variable called department. Not every row has University of Southampton information and this will create empty values.

  4. Filter out all the rows with empty department values.

  5. Pivot the table wider again creating columns with names from the name variable populated with values from the department variable.

This should mean we have a table with just the University of Southampton affiliations from Web of Science in columns prefixed C1_

Simple, huh?! It’s much more concise in code.

7.3.3.1 Step 1: Separate C1

As before, we separate using separate_wider_delim() using ; as the delimiter.

This time we end up with 7 columns as there are ; between author names and country.

lipid_wide_2 <- lipid_wide |> 
  separate_wider_delim(cols = C1, 
                       delim = ";", 
                       names_sep = "_", 
                       too_few = "align_start")

# Check the output
lipid_wide_2 |> 
  select(starts_with("C1")) |> 
  glimpse()
Rows: 1
Columns: 7
$ C1_1 <chr> "[Bilson, Josh"
$ C1_2 <chr> " Scorletti, Eleonora] Univ Southampton, Fac Med, Sch Human Dev &…
$ C1_3 <chr> " [Bilson, Josh"
$ C1_4 <chr> " Scorletti, Eleonora] Univ Southampton, Natl Inst Hlth Res South…
$ C1_5 <chr> " [Bilson, Josh"
$ C1_6 <chr> " Scorletti, Eleonora] Univ Hosp Southampton Natl Hlth Serv Fdn T…
$ C1_7 <chr> " [Scorletti, Eleonora] Univ Penn, Perelman Sch Med, Dept Genet, …

7.3.3.2 Steps 2: Pivot longer

Next we pivot_longer() and take advantage of starts_with() using character vector "C1_" as the argument to pivot on all the columns from step 1. The output is assigned to lipid_wide_3

lipid_wide_3 <- lipid_wide_2 |> 
  pivot_longer(cols = starts_with("C1"))

7.3.3.3 Steps 3 and 4 : Extract the University of Southampton department and filter

Then we use mutate() to create a new variable department based on str_extract() on the value variable with a regex passed as a string.

I created the regex pattern using Generative AI giving it the string [Bilson, Josh; Scorletti, Eleonora] Univ Southampton, Fac Med, Sch Human Dev & Hlth, Southampton, England; and a prompt:

"write a regex to capture the text after ']' and before ', Southampton, England'"

I pipe this to filter with argument !is.na() that means “if NOT an empty value” for department to filter out rows with value NA for department.

I pipe this to select and use negative selection to drop the value column created in the pivot longer as we no longer need it.

The output is assigned to lipid_wide_4.

lipid_wide_4 <- lipid_wide_3 |> 
  mutate(department = str_extract(value,"(?<=(\\]))(.+?)(?=, Southampton, England)")) |> 
  filter(!is.na(department)) |> 
  select(-value)

7.3.4 Step 5 : Pivot wider

Finally, we pivot wider taking the names from the name column and filling the columns with the values from department. The output is saved in lipid_wide_5.

This returns the lipid_wide table, but with the University of Southampton departments originally in C1 in column now in columns prefixed C1_

lipid_wide_5 <- lipid_wide_4 |> 
  pivot_wider(names_from = name, values_from = department)

# 
dim_desc(lipid_wide_5)
[1] "[1 x 112]"

Let’s look at

lipid_wide_5 |> 
  select(1, 6, starts_with("C1_"), starts_with("Affiliations_")) |> 
  flextable() |> 
  font(part = "all", fontname = "Consolas") |>
  bg(bg = "white", part = "all") |> 
  autofit()

Electronic version(s) of this work > DOI (Digital Object Identifier)-0

Title of the contribution in original language-5

C1_2

C1_4

C1_6

Affiliations_1

Affiliations_2

Affiliations_3

10.1097/MCO.0000000000000993

Lipid droplets in steatotic liver disease

Univ Southampton, Fac Med, Sch Human Dev & Hlth

Univ Southampton, Natl Inst Hlth Res Southampton, Biomed Res Ctr

Univ Hosp Southampton Natl Hlth Serv Fdn Trust

School of Human Development and Health, Faculty of Medicine, University of Southampton, United Kingdom

National Institute for Health Research, Southampton Biomedical Research Centre, University of Southampton, University Hospital Southampton, National Health Service Foundation Trust, Southampton, United Kingdom

Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States

7.4 Whole game

Let’s do that again, but for the whole table and in fewer parts

7.4.1 Join Scopus and Wos tables to the Pure table

pure_join <- pure_dat |> 
  left_join(scopus_dat, by = c(`Title of the contribution in original language-5` = "Title",
                               `Electronic version(s) of this work > DOI (Digital Object Identifier)-0` = "DOI")) |> 
  left_join(wos_dat, by = c(`Title of the contribution in original language-5` = "TI" , 
                           `Electronic version(s) of this work > DOI (Digital Object Identifier)-0` = "DI"))

# Print number of rows and columns with dim_desc()
dim_desc(pure_join)
[1] "[5,187 x 108]"

7.4.2 Separate the Scopus Affiliations

pure_wide <- pure_join |> 
  separate_wider_delim(cols = Affiliations, 
                       delim = ";", 
                       names_sep = "_", 
                       too_few = "align_start")

# Print number of rows and columns with dim_desc()
dim_desc(pure_wide) 
[1] "[5,187 x 185]"

As an extra step, I’m going to use the same pivot trick to get rid of any Scopus affiliations that aren’t University of Southampton.

7.4.3 Separate the WoS Affiliations

pure_wider <- pure_wide |> 
  separate_wider_delim(cols = C1, 
                       delim = ";", 
                       names_sep = "_", 
                       too_few = "align_start")


pure_long <- pure_wider |> 
  pivot_longer(cols = starts_with("C1_")) |> 
  mutate(department = str_extract(value,"(?<=(\\]))(.+?)(?=, Southampton, England)")) |> 
  select(-value)

pure_table <- pure_long |> 
  pivot_wider(names_from = name, values_from = department)


dim_desc(pure_table)
[1] "[5,187 x 193]"

Deal with the missing values and keep just the rows with Southampton affiliations.

pure_table_clean <- pure_table |> 
  pivot_longer(c(starts_with("Affiliations_"),starts_with("C1_")),
               values_to = "department", 
               values_drop_na = TRUE) |> 
  mutate(department = str_trim(department))

Do a grouped summary using only Southampton

pure_table_clean |> 
  filter(str_detect(department,"Southampton")) |> 
  rename(title = `Title of the contribution in original language-5`) |> 
  group_by(title, `Open Access status-13`) |> 
  reframe(department) |> 
  slice(1:10)
# A tibble: 10 × 3
   title                                       Open Access status-1…¹ department
   <chr>                                       <chr>                  <chr>     
 1 A machine-learning exploration of the expo… Indeterminate          School of…
 2 A machine-learning exploration of the expo… Indeterminate          NIHR Sout…
 3 A machine-learning exploration of the expo… Indeterminate          MRC Lifec…
 4 A range of replies                          Open                   Departmen…
 5 A scalable hybrid analog-digital architect… Embargoed              Institute…
 6 A track-independent vehicle indicator for … Open                   ISVR, Uni…
 7 AIDS impact special issue 2023: two-way as… Open                   CHERISH p…
 8 AIDS impact special issue 2023: two-way as… Open                   Departmen…
 9 Aligning academic career management in the… Embargoed              Southampt…
10 Alternative sources of bioactive omega-3 f… Closed                 Univ Hosp…
# ℹ abbreviated name: ¹​`Open Access status-13`