R approach for histogram

My paper about regular expressions for data reshaping was recently accepted into R journal. I used visualization of the iris data as a example to motivate reshaping. Recall the iris data look like this,

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

To make a facetted histogram, we need to combine the first four columns into a single column. To do that we define a regular expression to match those column names, then use that to reshape and visualize.

(iris.long <- nc::capture_melt_single(
  iris, part=".*", "[.]", dim=".*", value.name="cm"))
##        Species  part    dim  cm
##   1:    setosa Sepal Length 5.1
##   2:    setosa Sepal Length 4.9
##   3:    setosa Sepal Length 4.7
##   4:    setosa Sepal Length 4.6
##   5:    setosa Sepal Length 5.0
##  ---                           
## 596: virginica Petal  Width 2.3
## 597: virginica Petal  Width 1.9
## 598: virginica Petal  Width 2.0
## 599: virginica Petal  Width 2.3
## 600: virginica Petal  Width 1.8
library(ggplot2)
## Keep up to date with changes at https://www.tidyverse.org/blog/
ggplot()+
  geom_histogram(aes(
    cm, fill=Species),
    color="black",
    bins=20,
    data=iris.long)+
  facet_grid(part ~ dim, labeller=label_both)

plot of chunk iris.long

Recently data.table merged my PR which implements similar functionality:

library(data.table)
## data.table 1.14.1 IN DEVELOPMENT built 2021-05-25 12:11:15 UTC; tdhock using 1 threads (see ?getDTthreads).  Latest news: r-datatable.com
one.iris <- data.table(iris[1,])
nc::capture_melt_single(one.iris, part=".*", "[.]", dim=".*")
##    Species  part    dim value
## 1:  setosa Sepal Length   5.1
## 2:  setosa Sepal  Width   3.5
## 3:  setosa Petal Length   1.4
## 4:  setosa Petal  Width   0.2
data.table::melt(
  one.iris, measure.vars=measure(part, dim, pattern="(.*)[.](.*)"))
##    Species  part    dim value
## 1:  setosa Sepal Length   5.1
## 2:  setosa Sepal  Width   3.5
## 3:  setosa Petal Length   1.4
## 4:  setosa Petal  Width   0.2

This functionality is also implemented in the tidyr package:

pattern <- "(.*)[.](.*)"
tidyr::pivot_longer(
  one.iris,
  cols=matches(pattern),
  names_to=c("part", "dim"),
  names_pattern=pattern)
## # A tibble: 4 x 4
##   Species part  dim    value
##   <fct>   <chr> <chr>  <dbl>
## 1 setosa  Sepal Length   5.1
## 2 setosa  Sepal Width    3.5
## 3 setosa  Petal Length   1.4
## 4 setosa  Petal Width    0.2

Note that in the tidyr version above the pattern needs to be repeated in the cols and the names_pattern arguments.

Comparison with python approaches for histogram

I was curious to investigate the state of python modules for performing the same computations. Is it as easy as in R? The intent of the R code above is to use a regex pattern to specify: (1) the set of columns to melt/reshape/unpivot, and (2) the data to capture in the dim and part columns of the output. So my code below tries to translate that idea into python.

If you are limited to using pandas, this is possible but not easy. My first attempt at the data reshaping involved melt,

import pandas as pd
one_iris=pd.DataFrame({
    "Sepal.Length":[5.1],
    "Sepal.Width":[3.5],
    "Petal.Length":[1.4],
    "Petal.Width":[0.2],
    "Species":"setosa"})
pattern = "(?P<part>.*)[.](?P<dim>.*)"
col_extract = one_iris.columns.to_series().str.extract(pattern)
id_vars = [
    name for name,matched in
    zip(one_iris.columns, col_extract.iloc[:,0].notna())
    if not matched]
one_long = one_iris.melt(id_vars=id_vars)
pd.concat([one_long["variable"].str.extract(pattern), one_long], axis=1)
##     part     dim Species      variable  value
## 0  Sepal  Length  setosa  Sepal.Length    5.1
## 1  Sepal   Width  setosa   Sepal.Width    3.5
## 2  Petal  Length  setosa  Petal.Length    1.4
## 3  Petal   Width  setosa   Petal.Width    0.2

That was complicated! Luckily there is another function,

import janitor
import re
names_to = list(re.compile(pattern).groupindex.keys())
one_iris.pivot_longer(
    index="Species",
    names_pattern=pattern,
    names_to=names_to)
##   Species   part     dim  value
## 0  setosa  Sepal  Length    5.1
## 1  setosa  Sepal   Width    3.5
## 2  setosa  Petal  Length    1.4
## 3  setosa  Petal   Width    0.2

That was much easier! Note that the names and sub-patterns for each capture group were defined together in the same regular expression string literal, which was then compiled to get the group names to pass as the names_to argument. Note that the python iris data has slightly different column names (no dot),

iris_url = "https://raw.github.com/pandas-dev/pandas/master/pandas/tests/io/data/csv/iris.csv"
iris_wide = pd.read_csv(iris_url)
iris_wide
##      SepalLength  SepalWidth  PetalLength  PetalWidth            Name
## 0            5.1         3.5          1.4         0.2     Iris-setosa
## 1            4.9         3.0          1.4         0.2     Iris-setosa
## 2            4.7         3.2          1.3         0.2     Iris-setosa
## 3            4.6         3.1          1.5         0.2     Iris-setosa
## 4            5.0         3.6          1.4         0.2     Iris-setosa
## ..           ...         ...          ...         ...             ...
## 145          6.7         3.0          5.2         2.3  Iris-virginica
## 146          6.3         2.5          5.0         1.9  Iris-virginica
## 147          6.5         3.0          5.2         2.0  Iris-virginica
## 148          6.2         3.4          5.4         2.3  Iris-virginica
## 149          5.9         3.0          5.1         1.8  Iris-virginica
## 
## [150 rows x 5 columns]

So we must repeat those steps with a different pattern,

pattern = "(?P<part>Sepal|Petal)(?P<dim>Length|Width)"
names_to = list(re.compile(pattern).groupindex.keys())
iris_long = iris_wide.pivot_longer(
    index="Name",
    names_pattern=pattern,
    names_to=names_to,
    values_to="cm")
iris_long
##                Name   part     dim   cm
## 0       Iris-setosa  Sepal  Length  5.1
## 1       Iris-setosa  Sepal  Length  4.9
## 2       Iris-setosa  Sepal  Length  4.7
## 3       Iris-setosa  Sepal  Length  4.6
## 4       Iris-setosa  Sepal  Length  5.0
## ..              ...    ...     ...  ...
## 595  Iris-virginica  Petal   Width  2.3
## 596  Iris-virginica  Petal   Width  1.9
## 597  Iris-virginica  Petal   Width  2.0
## 598  Iris-virginica  Petal   Width  2.3
## 599  Iris-virginica  Petal   Width  1.8
## 
## [600 rows x 4 columns]

After that there are a number of modules to consider for making the histogram. Again my goal is to replicate the intent of the R ggplot2 code, which says “each Species is a different color, and each part/dim is a different panel.” At first I thought it may be possible to use the default pandas plotting methods, but it is clear from the docs that there is no easy way to specify a variable to use for plotting in different panels (a la facet_grid). Same for matplotlib and bokeh. A python alternative is altair, which does implement facets:

import altair as alt
chart = alt.Chart(iris_long).mark_bar().encode(
    alt.X("cm:Q", bin=True),
    y='count()',
    color="Name"
).facet(row="part", column="dim")
# need to do chart.show() then click save as PNG.

plot of iris histogram

R approach for scatterplot comparing parts

Next plot from the paper was a facetted scatterplot comparing parts… are Petals longer and/or wider than Sepals? To do that we need to change the reshape operation so we output a Sepal and a Petal column. In R this amounts to changing one of the regex group names to a special value that is recognized as the keyword for creating multiple output columns.

Package nc data.table tidyr
Function capture_melt_multiple melt + measure pivot_longer
Keyword column value.name .value
nc::capture_melt_multiple(one.iris, column=".*", "[.]", dim=".*")
##    Species    dim Petal Sepal
## 1:  setosa Length   1.4   5.1
## 2:  setosa  Width   0.2   3.5
data.table::melt(
  one.iris, measure.vars=measure(value.name, dim, pattern="(.*)[.](.*)"))
##    Species    dim Sepal Petal
## 1:  setosa Length   5.1   1.4
## 2:  setosa  Width   3.5   0.2
pattern <- "(.*)[.](.*)"
tidyr::pivot_longer(
  one.iris,
  cols=matches(pattern),
  names_to=c(".value", "part"),
  names_pattern=pattern)
## # A tibble: 2 x 4
##   Species part   Sepal Petal
##   <fct>   <chr>  <dbl> <dbl>
## 1 setosa  Length   5.1   1.4
## 2 setosa  Width    3.5   0.2

Doing that for the entire data set then plotting yields

(iris.parts <- nc::capture_melt_multiple(iris, column=".*", "[.]", dim=".*"))
##        Species    dim Petal Sepal
##   1:    setosa Length   1.4   5.1
##   2:    setosa Length   1.4   4.9
##   3:    setosa Length   1.3   4.7
##   4:    setosa Length   1.5   4.6
##   5:    setosa Length   1.4   5.0
##  ---                             
## 296: virginica  Width   2.3   3.0
## 297: virginica  Width   1.9   2.5
## 298: virginica  Width   2.0   3.0
## 299: virginica  Width   2.3   3.4
## 300: virginica  Width   1.8   3.0
ggplot()+
  coord_equal()+
  geom_abline(slope=1, intercept=0, color="grey")+
  geom_point(aes(
    Sepal, Petal, color=Species),
    data=iris.parts)+
  facet_grid(. ~ dim, labeller=label_both)

plot of chunk unnamed-chunk-7

Python approach for scatterplot comparing parts

Python janitor can do the reshape (same .value keyword as in R to indicate multiple outputs), but we can’t use a regex with named groups this time, because the dot is not allowed:

re.compile("(?P<.value>Sepal|Petal)(?P<dim>Length|Width)")
## Error in py_call_impl(callable, dots$args, dots$keywords): error: bad character in group name '.value' at position 4
## 
## Detailed traceback:
##   File "<string>", line 1, in <module>
##   File "/home/tdhock/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/re.py", line 233, in compile
##     return _compile(pattern, flags)
##   File "/home/tdhock/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/re.py", line 301, in _compile
##     p = sre_compile.compile(pattern, flags)
##   File "/home/tdhock/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/sre_compile.py", line 562, in compile
##     p = sre_parse.parse(p, flags)
##   File "/home/tdhock/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/sre_parse.py", line 855, in parse
##     p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
##   File "/home/tdhock/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/sre_parse.py", line 416, in _parse_sub
##     not nested and not items))
##   File "/home/tdhock/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/sre_parse.py", line 647, in _parse
##     raise source.error(msg, len(name) + 1)

Well I guess you could use do the following,

names_pattern = "(?P<_value>Sepal|Petal)(?P<dim>Length|Width)"
names_to = [x.replace("_", ".") for x in re.compile(names_pattern).groupindex.keys()]
names_to
## ['.value', 'dim']
iris_parts = iris_wide.pivot_longer(
    index="Name",
    names_pattern=pattern,
    names_to=names_to)
iris_parts
##                Name     dim  Sepal  Petal
## 0       Iris-setosa  Length    5.1    1.4
## 1       Iris-setosa  Length    4.9    1.4
## 2       Iris-setosa  Length    4.7    1.3
## 3       Iris-setosa  Length    4.6    1.5
## 4       Iris-setosa  Length    5.0    1.4
## ..              ...     ...    ...    ...
## 295  Iris-virginica   Width    3.0    2.3
## 296  Iris-virginica   Width    2.5    1.9
## 297  Iris-virginica   Width    3.0    2.0
## 298  Iris-virginica   Width    3.4    2.3
## 299  Iris-virginica   Width    3.0    1.8
## 
## [300 rows x 4 columns]

Now let’s try to plot with altair again,

import altair as alt
chart = alt.Chart(iris_parts).mark_circle().encode(
    x="Sepal",
    y="Petal",
    color="Name"
).facet(column="dim")

scatterplot of iris

Two ggplot2 features which are missing in the altair plot above are coord_equal and geom_abline. These are useful when you want to emphasize that the x and y axes have the same units, so you can see if the data are above the diagonal, are Petals longer/wider than Sepals?

So did you read my blog tutorial from last year, in which we explored python datatable and plotnine? Well, datatable still has not implemented the reshape/melt functionality, and plotnine still seems to have the best ggplot2 emulation:

import plotnine as p9
gg_scatter_parts = p9.ggplot()+\
    p9.geom_abline(
        slope=1, intercept=0,
        color="grey")+\
    p9.geom_point(
        p9.aes(x="Sepal", y="Petal", fill="Name"),
        iris_parts)+\
    p9.facet_grid(". ~ dim", labeller="label_both")+\
    p9.coord_equal()

plot of scatter_parts

So janitor + plotnine works well here! Incidentally, we can also do this reshape with plain pandas:

iris_wide["id"] = iris_wide.index
pd.wide_to_long(
    iris_wide, ["Petal", "Sepal"],
    i="id", j="dim", sep="", suffix="(Width|Length)"
).reset_index()
##       id     dim            Name  Petal  Sepal
## 0      0  Length     Iris-setosa    1.4    5.1
## 1      1  Length     Iris-setosa    1.4    4.9
## 2      2  Length     Iris-setosa    1.3    4.7
## 3      3  Length     Iris-setosa    1.5    4.6
## 4      4  Length     Iris-setosa    1.4    5.0
## ..   ...     ...             ...    ...    ...
## 295  145   Width  Iris-virginica    2.3    3.0
## 296  146   Width  Iris-virginica    1.9    2.5
## 297  147   Width  Iris-virginica    2.0    3.0
## 298  148   Width  Iris-virginica    2.3    3.4
## 299  149   Width  Iris-virginica    1.8    3.0
## 
## [300 rows x 5 columns]

So pd.wide_to_long in python is similar to stats::reshape in R, in that (1) two groups are assumed, and (2) it is assumed the .value / value.name / column group comes first. In other words, these work for this reshape operation, but they do NOT work for the similar reshape in the next section (without some pre-processing of column names).

R approach for scatterplot comparing dims

In this section we reshape into Length and Width columns. In R we just have to move the keyword from the first to the second group,

nc::capture_melt_multiple(one.iris, part=".*", "[.]", column=".*")
##    Species  part Length Width
## 1:  setosa Petal    1.4   0.2
## 2:  setosa Sepal    5.1   3.5
data.table::melt(
  one.iris, measure.vars=measure(part, value.name, pattern="(.*)[.](.*)"))
##    Species  part Length Width
## 1:  setosa Sepal    5.1   3.5
## 2:  setosa Petal    1.4   0.2
pattern <- "(.*)[.](.*)"
tidyr::pivot_longer(
  one.iris,
  cols=matches(pattern),
  names_to=c("part", ".value"),
  names_pattern=pattern)
## # A tibble: 2 x 4
##   Species part  Length Width
##   <fct>   <chr>  <dbl> <dbl>
## 1 setosa  Sepal    5.1   3.5
## 2 setosa  Petal    1.4   0.2

Doing that for the entire data set then plotting yields

(iris.dims <- nc::capture_melt_multiple(iris, part=".*", "[.]", column=".*"))
##        Species  part Length Width
##   1:    setosa Petal    1.4   0.2
##   2:    setosa Petal    1.4   0.2
##   3:    setosa Petal    1.3   0.2
##   4:    setosa Petal    1.5   0.2
##   5:    setosa Petal    1.4   0.2
##  ---                             
## 296: virginica Sepal    6.7   3.0
## 297: virginica Sepal    6.3   2.5
## 298: virginica Sepal    6.5   3.0
## 299: virginica Sepal    6.2   3.4
## 300: virginica Sepal    5.9   3.0
ggplot()+
  coord_equal()+
  geom_abline(slope=1, intercept=0, color="grey")+
  geom_point(aes(
    Length, Width, color=Species),
    data=iris.dims)+
  facet_grid(. ~ part, labeller=label_both)

plot of chunk unnamed-chunk-13

Python approach for scatterplot comparing dims

As in the previous problem, we can use janitor + plotnine:

names_pattern = "(?P<part>Sepal|Petal)(?P<_value>Length|Width)"
names_to = [x.replace("_", ".") for x in re.compile(names_pattern).groupindex.keys()]
names_to
## ['part', '.value']
iris_dims = iris_wide.pivot_longer(
    index="Name",
    names_pattern=pattern,
    names_to=names_to)
## /home/tdhock/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/janitor/utils.py:1500: FutureWarning: This dataframe has a column name that matches the 'value_name' column name of the resultiing Dataframe. In the future this will raise an error, please set the 'value_name' parameter of DataFrame.melt to a unique name.
iris_dims
##                Name   part  Length  Width   0
## 0       Iris-setosa  Petal     1.4    0.2 NaN
## 1       Iris-setosa  Petal     1.4    0.2 NaN
## 2       Iris-setosa  Petal     1.3    0.2 NaN
## 3       Iris-setosa  Petal     1.5    0.2 NaN
## 4       Iris-setosa  Petal     1.4    0.2 NaN
## ..              ...    ...     ...    ...  ..
## 295  Iris-virginica  Sepal     6.7    3.0 NaN
## 296  Iris-virginica  Sepal     6.3    2.5 NaN
## 297  Iris-virginica  Sepal     6.5    3.0 NaN
## 298  Iris-virginica  Sepal     6.2    3.4 NaN
## 299  Iris-virginica  Sepal     5.9    3.0 NaN
## 
## [300 rows x 5 columns]
gg_scatter_dims = p9.ggplot()+\
    p9.geom_abline(
        slope=1, intercept=0,
        color="grey")+\
    p9.geom_point(
        p9.aes(x="Length", y="Width", fill="Name"),
        iris_dims)+\
    p9.facet_grid(". ~ part", labeller="label_both")+\
    p9.coord_equal()

plot of scatter_dims

No integrated type conversion in python

In R we can do type conversion during the reshape,

DT <- data.table(id=1, child1_sex="M", child1_age=34, child2_sex="F")
pattern <- "(.)_(.*)"
names_transform <- list(number_int=as.integer, .value=identity)
tidyr::pivot_longer(
  DT, matches(pattern),
  names_to=names(names_transform),
  names_transform=names_transform,
  names_pattern=pattern)
## # A tibble: 2 x 4
##      id number_int sex     age
##   <dbl>      <int> <chr> <dbl>
## 1     1          1 M        34
## 2     1          2 F        NA
print(
  melt(DT, measure.vars=measurev(
    names_transform, pattern=pattern, multiple.keyword=".value"))
, class=TRUE)
##       id number_int    sex   age
##    <num>      <int> <char> <num>
## 1:     1          1      M    34
## 2:     1          2      F    NA
number_pattern <- list(number_int=".", as.integer)
print(
  nc::capture_melt_multiple(
    DT, number_pattern, "_", column=".*", fill=TRUE)
, class=TRUE)
##       id number_int   age    sex
##    <num>      <int> <num> <char>
## 1:     1          1    34      M
## 2:     1          2    NA      F

Note that tidyr and data.table syntax require definition of the regex pattern in a separately from the names/conversions which are defined in names_transform. In contrast, the nc syntax allows definition of these three related pieces of information together in group-specific sub-pattern list variables, for example number_pattern. The results above show that number_int is indeed of type int — is this possible in python?

DT = pd.DataFrame({
    "id":[1],
    "child1_sex":["M"],
    "child1_age":[34],
    "child2_sex":["F"]
})
DT_long = DT.pivot_longer(
    index="id",
    names_pattern="(.)_(.*)",
    names_to = ["number", ".value"])
DT_long
##    id number sex   age
## 0   1      1   M  34.0
## 1   1      2   F   NaN
type(DT_long["number"][0])
## <class 'str'>

There is no names_transform argument in python (number is actually a string), but you can always do the transform after the fact (less efficient/convenient),

DT_long["number_int"] = DT_long["number"].astype(int)
DT_long
##    id number sex   age  number_int
## 0   1      1   M  34.0           1
## 1   1      2   F   NaN           2
type(DT_long["number_int"][0])
## <class 'numpy.int64'>

Conclusions

So in conclusion the python pivot_longer is a decent data reshaping tool, but still is not quite as fully featured as the software we have in R. In particular we observed the following:

  • Named capture groups are supported in python, and are useful since they allow defining the capture group names and sub-patterns together in the regex string literal. This is an advantage over R tidyr, which does not support named groups because it uses the ICU C regex library (does not export of group names to R). Even better (in terms of keeping related information together) is the R nc syntax, which allows defining a list for each capture group that contains: (1) group name, (2) regex pattern, and (3) type conversion function.
  • Named capture groups can not be easily used with python pivot_longer for reshape to multiple output columns. This is because pivot_longer outputs multiple value columns when names_to=".value" is specified, but .value can not be used as a capture group name in the regex. We showed a workaround using the regex capture group named _value and renamed to .value.
  • R packages provide integrated type conversion, but python pivot_longer does not (although types can be converted after the reshape operation).

Of the python modules for visualization,

  • neither bokeh, nor the pandas plot method, nor matplotlib, supports facets (subplots defined via column names).
  • altair supports facets but does not support coord_equal and geom_abline.
  • plotnine seems to be the best way in python to emulate the functionality we have in R with ggplot2.