1.1. Distributions, long tails, etc.

Let’s look at the city distribution first:

data_m %>% 
  group_by(Ville) %>%
  summarize(Count=n()) %>%
  arrange(Count) %>%
  filter(Count > 20) %>%
  ggplot(aes(y=reorder(Ville, Count), x=Count))+
  geom_bar(stat='identity', aes(fill=Ville))+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  scale_color_viridis()+
  theme(axis.title.y=element_blank(), 
        legend.position = "none")+
  labs(x='Number types of objects')

As it can be seen from the plot below, small cities tent to have < 20 types, therefore let’s stick with the first 13.

Let’s look at all the cities in the database. However, instead of the city’s name, I will be plotting the rank of the city relative to the number of items types coming from it:

data_m %>% 
  group_by(Ville) %>%
  summarize(Count=n()) %>%
  arrange(desc(Count)) %>%
  mutate(Rank = row_number()) %>%
  ggplot()+
  geom_point(aes(x=Rank, y=Count))+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  scale_color_viridis()+
  theme(axis.title.y=element_blank(), 
        legend.position = "none")+
  annotate(
    geom = "curve", x = 99, y = 452, xend = 4, yend = 532, 
    curvature = .3, arrow = arrow(length = unit(2, "mm"))
  ) +
  annotate(geom = "text", x = 101, y = 455, label = "Paris", hjust = "left")

Seems like the number of items are distributed across some power law. Let’s try Zipf’s law…:

alpha = 1

data_m %>% 
  group_by(Ville) %>%
  summarize(Count=n()) %>%
  arrange(desc(Count)) %>%
  mutate(Rank = row_number(), 
         zipfs_freq = ifelse(Rank == 1, Count, dplyr::first(Count) / Rank^alpha)) %>%
  ggplot()+
  geom_point(aes(x=log(Rank), y=log(Count)), alpha=0.7)+
  geom_line(aes(x=log(Rank), y=log(zipfs_freq)), color='#26A96C')+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  labs(x='Log Rank', y='Log N of Items')

## `summarise()` ungrouping output (override with `.groups` argument)

Indeed, the number of items in each city is distributed according to Zipf’s law. For this particular study it means that there is a long tail with similar number of items, which could be potentially ignored.

Needles to say, the exact same thing happens with the object types:

data_m %>% 
  group_by(Objets.collectionnés) %>%
  summarize(Count=n()) %>%
  arrange(desc(Count)) %>%
  mutate(Rank = row_number()) %>%
  ggplot()+
  geom_point(aes(x=Rank, y=Count))+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  labs(x='Type Rank')

## `summarise()` ungrouping output (override with `.groups` argument)

Once again, there are some 10-15 very frequent types and then a long tail of infrequent ones. Does it follow Zipf’s law?

alpha = 1

data_m %>% 
  group_by(Objets.collectionnés) %>%
  summarize(Count=n()) %>%
  arrange(desc(Count)) %>%
  mutate(Rank = row_number(), 
         zipfs_freq = ifelse(Rank == 1, Count, dplyr::first(Count) / Rank^alpha)) %>%
  ggplot()+
  geom_point(aes(x=log(Rank), y=log(Count)), alpha=0.7)+
  geom_line(aes(x=log(Rank), y=log(zipfs_freq)), color='#26A96C')+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  labs(x='Log Rank of Type', y='Log N of Items')

## `summarise()` ungrouping output (override with `.groups` argument)

Not quite, but the notion of a long tail still holds.

Therefore, let’s limit ourselves to 13 cities with the biggest number of variability in types of objects and to 14 most frequent types:

X	Département	Ville	Nom	Prénom	Rue	Objets.collectionnés	Notes..état.de.la.collection.	Type	City_subset
0	Ain	Bourg	du Péloux	M.		Céramique		Céramique	Other
1	Ain	Bourg	du Péloux	M.		Livres		Livres	Other
2	Ain	Montréal	Comte de Douglas		Château de Montréal	Ex-libris	4000 environ	Other	Other
3	Ain	Montréal	Comte de Douglas		Château de Montréal	Blasons		Other	Other
4	Ain	Montréal	Comte de Douglas		Château de Montréal	Livres de blasons		Other	Other
5	Ain	Montréal	Comte de Douglas		Château de Montréal	Listels ou blasons employés par les Pompes funèbres	2000 environs	Other	Other

As you can see from the table above, I modified the dataframe in order to:

Only include the 14 most frequent items in the collections (‘Type’ column)
Only include the 13 most frequent cities (‘City_subset’ column)

1. 2. Different types of object across the bigest cities:

Let’s look a the cities with most items in the dataset:

data_m[data_m$Type != 'Other',] %>%
  group_by(City_subset, Type) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) %>%
  ggplot(aes(x=reorder(Type, freq), y=freq))+
  geom_bar(aes(color=Type, fill=Type), stat='Identity', position="dodge")+
  scale_y_continuous(breaks=c(0.01, 0.2, 0.4), labels=c('1%', '20%', '40%'))+
  coord_flip()+
  facet_wrap(~City_subset)+
  theme_minimal()+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  theme(axis.title.y=element_blank(), 
        legend.position = "none")+
  labs(y='Percentage')

In all of the cities above, the most popular items are usually Paintings and Ceramics. And in all of the ‘long tail’ of cities the situation seems to be the same overall (see ‘Other’ bar-plot).

Let’s map the cities with the number of ceramics and paintings in them, as these two types are the most frequent ones:

coordss <- merge(x = data_m, y = coords, by.x = "Ville", by.y='city')

coordss[coordss$Objets.collectionnés %in% c('Céramique', 'Tableaux'),] %>%
  group_by(Ville, Objets.collectionnés) %>%
  summarize(Count=n(), Lat=max(Lat), Long=max(Long)) %>%
  arrange(desc(Count)) %>%
  filter(Count > 3) %>%
  ggplot(aes(x=Lat, y=Long))+
  theme_void()+
  coord_map()+
  geom_polygon(data=fr, aes(x = long, y = lat, group=group), fill='#F1DEDC')+
  geom_point(aes(size=Count), color='red')+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  geom_text_repel(aes(label=Ville))

With most paintings?

coordss <- merge(x = data_m, y = coords, by.x = "Ville", by.y='city')

coordss[coordss$Objets.collectionnés == 'Tableaux',] %>%
  group_by(Ville) %>%
  summarize(Count=n(), Lat=max(Lat), Long=max(Long)) %>%
  arrange(desc(Count)) %>%
  filter(Count > 3) %>%
  ggplot(aes(x=Lat, y=Long))+
  coord_map()+
  theme_void()+
  geom_polygon(data=fr, aes(x = long, y = lat, group=group), fill='#F1DEDC')+
  geom_point(aes(size=Count), color='red')+
  theme(plot.background = element_rect(fill = "#fffff8"))+
  geom_text_repel(aes(label=Ville))

2. Graphs and polygons

The dummy vectors that we have built for each city can be used to build distance matrices in-between cities or collectors. See the notebook for the pre-processing procedure.

2.1. Usefull functions

First of all, let’s define a function which will build a graph given the pairwise distance matrix:

build_graph <- function(distance){
  distance <- data.frame(distance)
  is.na(distance) <- sapply(distance, is.infinite)
  distance[is.na(distance)] <- 0
  g <- graph.adjacency(data.matrix(distance), mode = "undirected", diag = FALSE, weighted = T) 
  return(g)
}

Then I will build a function that will take the graph outputted by the previous function and calculate all of the desired graph-based metrics on it. For instance, it can be used to calculate degree centrality.

calculate_g_stats <- function(g){
  strength <- strength(g, v=V(g))
  metrics <- data.frame(strength)
  metrics <- cbind(values_name = rownames(metrics), metrics)
  rownames(metrics) <- 1:nrow(metrics)
  metrics$closeness <- closeness(g, v=V(g))
  metrics$betweenness <- betweenness(g, v=V(g))
  return(metrics)
}

And, finally the plotting function:

plot_graph <- function(g){
  isol <-  which(degree(g)==0)
  g_wi <-  delete.vertices(g, isol)
  ggraph(g_wi, layout = 'graphopt')+
    geom_node_point(size=0.4)+
    geom_edge_link(alpha=0.1)+
    theme_void()+
    geom_node_text(aes(label = name), size=2, color="blue", repel=T)+
    theme(plot.background = element_rect(fill = "#fffff8"))+
    coord_fixed()
}

2.1. Graph of the collectors

Read the distance matrix build for each name in the database:

distance <- read.csv('dist_matrix_Nom.csv', row.names = 'Nom')
g_names <- build_graph(distance)

As the graph is very dence, let’s look at first 30 collectors:

plot_graph(build_graph(distance[,1:30][1:30,]))

Overall, it’s a fully interconnected graph, it’s nearly impossible to deduce anything just by looking at it.

resl <- calculate_g_stats(g_names)

Let’s calculate the strength of each vertex, to deduce which of the of collectors influences others the most. That is, the collector that has the highest combined weight in the graph.

ggplot(data=resl[resl$strength > 4500,], aes(y=reorder(values_name, strength), x=strength))+
  geom_bar(stat="identity")+
  theme_minimal()+
  theme(axis.title.y=element_blank())+
  labs(x='Strength', fill='Family')+
  theme(plot.background = element_rect(fill = "#fffff8"))

Let’s look at the betweenness centrality for the cities.

ggplot(data=resl[resl$betweenness > 50,], aes(y=reorder(values_name, betweenness), x=betweenness))+
  geom_bar(stat="identity")+
  theme_minimal()+
  theme(axis.title.y=element_blank())+
  labs(x='Betweenness', fill='Family')+
  theme(plot.background = element_rect(fill = "#fffff8"))

2.2. Graph of the cities

distance <- read.csv('dist_matrix_Ville.csv', row.names = 'Ville')
g <- graph.adjacency(data.matrix(distance), mode = "undirected", diag = FALSE, weighted = T)

Additionally, we can build the map using the closeness centrality measure for each city. That is, the number of steps needed to asses any city from a given city. As the weights in our graph represent similarity in terms of collected items, it may highlight the most influential cities in the database.

resl <- closeness(g, v=V(g))
# res_d$res <- res_d$res / deg$degree.g_t.
resl <- data.frame(resl)
resl <- cbind(city = rownames(resl), resl)
rownames(resl) <- 1:nrow(resl)

resl$centr <- strength(g, v=V(g))
# resl$resl_i <- 1/resl$resl

merged <- merge(x = resl, y = coords, by = "city")

merged <- merged[!is.na(merged$resl),]
merged <- merged[!duplicated(merged$city),]

This can be visualized using this Voronoi diagram, where the cities are partitioned based on the values of the closeness centrality. Additionally, each colored dot represents the weighted vertex degree, to highlight the cities that are the most influential overall. The color of the Voronoi polygons represents the closeness centrality:

ggplot(merged)+
  geom_voronoi(aes(Lat, Long, fill=resl), show.legend = FALSE)+
  scale_fill_continuous(high = "#F0FBEF", low = "#1F6218")+
  geom_point(data=merged[merged$resl <  0.0006004510,], aes(Lat, Long, size=centr), show.legend = FALSE, color='#FE5F00', alpha=0.4)+
  theme_void()+
  labs(fill='Closeness')+
  geom_text_repel(data=merged[merged$resl < 0.0006004510,],
            aes(Lat, Long, label=city), color='#000100')+
  coord_fixed(ratio = 1.15)+
  theme(plot.background = element_rect(fill = "#fffff8"))

Humanités Numeriques: Validation

Alexey Koshevoy

11/19/2020

1. Plots