Humanités Numeriques: Validation

Alexey Koshevoy

11/19/2020

1. Plots

To categorize each collector, I transformed the data about their collection in a binary vector of dummy variables. That is, each collector is represented as a binary vector of a length 1007, each digit representing a presence or absence of a particular type of object (Céramique or Curiosités or any other category from the DB). The same operation can be done for each city as well. The notebook with all of the pre-processing can be found here.

1.1. Distributions, long tails, etc.

Let’s look at the city distribution first:

data_m %>% 
  group_by(Ville) %>%
  summarize(Count=n()) %>%
  arrange(Count) %>%
  filter(Count > 20) %>%
  ggplot(aes(y=reorder(Ville, Count), x=Count))+
  geom_bar(stat='identity', aes(fill=Ville))+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  scale_color_viridis()+
  theme(axis.title.y=element_blank(), 
        legend.position = "none")+
  labs(x='Number types of objects')

As it can be seen from the plot below, small cities tent to have < 20 types, therefore let’s stick with the first 13.

Let’s look at all the cities in the database. However, instead of the city’s name, I will be plotting the rank of the city relative to the number of items types coming from it:

data_m %>% 
  group_by(Ville) %>%
  summarize(Count=n()) %>%
  arrange(desc(Count)) %>%
  mutate(Rank = row_number()) %>%
  ggplot()+
  geom_point(aes(x=Rank, y=Count))+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  scale_color_viridis()+
  theme(axis.title.y=element_blank(), 
        legend.position = "none")+
  annotate(
    geom = "curve", x = 99, y = 452, xend = 4, yend = 532, 
    curvature = .3, arrow = arrow(length = unit(2, "mm"))
  ) +
  annotate(geom = "text", x = 101, y = 455, label = "Paris", hjust = "left")

Seems like the number of items are distributed across some power law. Let’s try Zipf’s law…:

alpha = 1

data_m %>% 
  group_by(Ville) %>%
  summarize(Count=n()) %>%
  arrange(desc(Count)) %>%
  mutate(Rank = row_number(), 
         zipfs_freq = ifelse(Rank == 1, Count, dplyr::first(Count) / Rank^alpha)) %>%
  ggplot()+
  geom_point(aes(x=log(Rank), y=log(Count)), alpha=0.7)+
  geom_line(aes(x=log(Rank), y=log(zipfs_freq)), color='#26A96C')+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  labs(x='Log Rank', y='Log N of Items')
## `summarise()` ungrouping output (override with `.groups` argument)

Indeed, the number of items in each city is distributed according to Zipf’s law. For this particular study it means that there is a long tail with similar number of items, which could be potentially ignored.

Needles to say, the exact same thing happens with the object types:

data_m %>% 
  group_by(Objets.collectionnés) %>%
  summarize(Count=n()) %>%
  arrange(desc(Count)) %>%
  mutate(Rank = row_number()) %>%
  ggplot()+
  geom_point(aes(x=Rank, y=Count))+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  labs(x='Type Rank')
## `summarise()` ungrouping output (override with `.groups` argument)

Once again, there are some 10-15 very frequent types and then a long tail of infrequent ones. Does it follow Zipf’s law?

alpha = 1

data_m %>% 
  group_by(Objets.collectionnés) %>%
  summarize(Count=n()) %>%
  arrange(desc(Count)) %>%
  mutate(Rank = row_number(), 
         zipfs_freq = ifelse(Rank == 1, Count, dplyr::first(Count) / Rank^alpha)) %>%
  ggplot()+
  geom_point(aes(x=log(Rank), y=log(Count)), alpha=0.7)+
  geom_line(aes(x=log(Rank), y=log(zipfs_freq)), color='#26A96C')+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  labs(x='Log Rank of Type', y='Log N of Items')
## `summarise()` ungrouping output (override with `.groups` argument)

Not quite, but the notion of a long tail still holds.

Therefore, let’s limit ourselves to 13 cities with the biggest number of variability in types of objects and to 14 most frequent types:

X Département Ville Nom Prénom Profession N..rue Rue Objets.collectionnés Notes..état.de.la.collection. Type City_subset
0 Ain Bourg du Péloux M. Céramique Céramique Other
1 Ain Bourg du Péloux M. Livres Livres Other
2 Ain Montréal Comte de Douglas Château de Montréal Ex-libris 4000 environ Other Other
3 Ain Montréal Comte de Douglas Château de Montréal Blasons Other Other
4 Ain Montréal Comte de Douglas Château de Montréal Livres de blasons Other Other
5 Ain Montréal Comte de Douglas Château de Montréal Listels ou blasons employés par les Pompes funèbres 2000 environs Other Other

As you can see from the table above, I modified the dataframe in order to:

1. 2. Different types of object across the bigest cities:

Let’s look a the cities with most items in the dataset:

data_m[data_m$Type != 'Other',] %>%
  group_by(City_subset, Type) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) %>%
  ggplot(aes(x=reorder(Type, freq), y=freq))+
  geom_bar(aes(color=Type, fill=Type), stat='Identity', position="dodge")+
  scale_y_continuous(breaks=c(0.01, 0.2, 0.4), labels=c('1%', '20%', '40%'))+
  coord_flip()+
  facet_wrap(~City_subset)+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  theme(axis.title.y=element_blank(), 
        legend.position = "none")+
  labs(y='Percentage')

In all of the cities above, the most popular items are usually Paintings and Ceramics. And in all of the ‘long tail’ of cities the situation seems to be the same overall (see ‘Other’ bar-plot).

Let’s map the cities with the number of ceramics and paintings in them, as these two types are the most frequent ones:

coordss <- merge(x = data_m, y = coords, by.x = "Ville", by.y='city')

coordss[coordss$Objets.collectionnés %in% c('Céramique', 'Tableaux'),] %>%
  group_by(Ville, Objets.collectionnés) %>%
  summarize(Count=n(), Lat=max(Lat), Long=max(Long)) %>%
  arrange(desc(Count)) %>%
  filter(Count > 3) %>%
  ggplot(aes(x=Lat, y=Long))+
  theme_void()+
  coord_map()+
  geom_polygon(data=fr, aes(x = long, y = lat, group=group), fill='#F1DEDC')+
  geom_point(aes(size=Count), color='red')+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  geom_text_repel(aes(label=Ville))

With most paintings?

coordss <- merge(x = data_m, y = coords, by.x = "Ville", by.y='city')

coordss[coordss$Objets.collectionnés == 'Tableaux',] %>%
  group_by(Ville) %>%
  summarize(Count=n(), Lat=max(Lat), Long=max(Long)) %>%
  arrange(desc(Count)) %>%
  filter(Count > 3) %>%
  ggplot(aes(x=Lat, y=Long))+
  coord_map()+
  theme_void()+
  geom_polygon(data=fr, aes(x = long, y = lat, group=group), fill='#F1DEDC')+
  geom_point(aes(size=Count), color='red')+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  geom_text_repel(aes(label=Ville))

2. Graphs and polygons

The dummy vectors that we have built for each city can be used to build distance matrices in-between cities or collectors. See the notebook for the pre-processing procedure.

2.1. Usefull functions

First of all, let’s define a function which will build a graph given the pairwise distance matrix:

build_graph <- function(distance){
  distance <- data.frame(distance)
  is.na(distance) <- sapply(distance, is.infinite)
  distance[is.na(distance)] <- 0
  g <- graph.adjacency(data.matrix(distance), mode = "undirected", diag = FALSE, weighted = T) 
  return(g)
}

Then I will build a function that will take the graph outputted by the previous function and calculate all of the desired graph-based metrics on it. For instance, it can be used to calculate degree centrality.

calculate_g_stats <- function(g){
  strength <- strength(g, v=V(g))
  metrics <- data.frame(strength)
  metrics <- cbind(values_name = rownames(metrics), metrics)
  rownames(metrics) <- 1:nrow(metrics)
  metrics$closeness <- closeness(g, v=V(g))
  metrics$betweenness <- betweenness(g, v=V(g))
  return(metrics)
}

And, finally the plotting function:

plot_graph <- function(g){
  isol <-  which(degree(g)==0)
  g_wi <-  delete.vertices(g, isol)
  ggraph(g_wi, layout = 'graphopt')+
    geom_node_point(size=0.4)+
    geom_edge_link(alpha=0.1)+
    theme_void()+
    geom_node_text(aes(label = name), size=2, color="blue", repel=T)+
    theme(plot.background = element_rect(fill = "#fffff8"))+
    coord_fixed()
}

2.1. Graph of the collectors

Read the distance matrix build for each name in the database:

distance <- read.csv('dist_matrix_Nom.csv', row.names = 'Nom')
g_names <- build_graph(distance)

As the graph is very dence, let’s look at first 30 collectors:

plot_graph(build_graph(distance[,1:30][1:30,]))

Overall, it’s a fully interconnected graph, it’s nearly impossible to deduce anything just by looking at it.

resl <- calculate_g_stats(g_names)

Let’s calculate the strength of each vertex, to deduce which of the of collectors influences others the most. That is, the collector that has the highest combined weight in the graph.

ggplot(data=resl[resl$strength > 4500,], aes(y=reorder(values_name, strength), x=strength))+
  geom_bar(stat="identity")+
  theme_minimal()+
  theme(axis.title.y=element_blank())+
  labs(x='Strength', fill='Family')+
  theme(plot.background = element_rect(fill = "#fffff8"))

Let’s look at the betweenness centrality for the cities.

ggplot(data=resl[resl$betweenness > 50,], aes(y=reorder(values_name, betweenness), x=betweenness))+
  geom_bar(stat="identity")+
  theme_minimal()+
  theme(axis.title.y=element_blank())+
  labs(x='Betweenness', fill='Family')+
  theme(plot.background = element_rect(fill = "#fffff8"))

2.2. Graph of the cities

distance <- read.csv('dist_matrix_Ville.csv', row.names = 'Ville')
g <- graph.adjacency(data.matrix(distance), mode = "undirected", diag = FALSE, weighted = T) 

Additionally, we can build the map using the closeness centrality measure for each city. That is, the number of steps needed to asses any city from a given city. As the weights in our graph represent similarity in terms of collected items, it may highlight the most influential cities in the database.

resl <- closeness(g, v=V(g))
# res_d$res <- res_d$res / deg$degree.g_t.
resl <- data.frame(resl)
resl <- cbind(city = rownames(resl), resl)
rownames(resl) <- 1:nrow(resl)

resl$centr <- strength(g, v=V(g))
# resl$resl_i <- 1/resl$resl

merged <- merge(x = resl, y = coords, by = "city")

merged <- merged[!is.na(merged$resl),]
merged <- merged[!duplicated(merged$city),]

This can be visualized using this Voronoi diagram, where the cities are partitioned based on the values of the closeness centrality. Additionally, each colored dot represents the weighted vertex degree, to highlight the cities that are the most influential overall. The color of the Voronoi polygons represents the closeness centrality:

ggplot(merged)+
  geom_voronoi(aes(Lat, Long, fill=resl), show.legend = FALSE)+
  scale_fill_continuous(high = "#F0FBEF", low = "#1F6218")+
  geom_point(data=merged[merged$resl <  0.0006004510,], aes(Lat, Long, size=centr), show.legend = FALSE, color='#FE5F00', alpha=0.4)+
  theme_void()+
  labs(fill='Closeness')+
  geom_text_repel(data=merged[merged$resl < 0.0006004510,],
            aes(Lat, Long, label=city), color='#000100')+
  coord_fixed(ratio = 1.15)+
  theme(plot.background = element_rect(fill = "#fffff8"))