Applied Data Science Capstone — Fado, the sound of saudades.

Applied Data Science Capstone — Fado, the sound of saudades.

Introduction and business problem

Fado is the sound of Lisbon. This characteristic style of Portuguese music is part of the UNESCO Intangible Cultural Heritage List, and no trip to Lisbon is complete without a visit to a Fado show. Traditionally, the neighbourhood of Alfama is home to most of the Fado Venues in Lisbon, and walking through its narrow streets one is often greeted by the haunting voice of the many Fadistas who grace the neighbourhood with their voices. However, while tradition is very important to the Portuguese, the recent COVID pandemic necessitates a rethinking of the commercial potential of Fado. Tourism has long been one of Portugal’s main sources of income, and my client would like to open a cultural centre to capitalise on the tourism industry and to showcase some of the best talent on the Fado scene in a single location. Given Lisbon’s very hilly topography, my client is looking for a cultural and touristic hotspot, where a potential client wouldn’t have to walk more than 250 meters in any direction to find a choice of several Fado venues and hotels. In order to do this, I will map the hotels and Fado venues in the centre of Lisbon using their geographical coordinates, and use a clustering algorithm to find the cultural and tourist hotspots for my clients venture.

Data Section

The data I will use to solve this problem are:

Geolocation and names of the of the Fado venues in Lisbon.

Geolocation, names and types of the accommodation available in Libson.

I will source my data using the following:

Foursquare API for the accommodation data.

Google Places API for the Fado venues data.

Methodology

Firstly, I went about installing the libraries and resources I would use for the project, and I added more as I progressed.

The first datapoints I required were the geolocation coordinates of the centre of Lisbon, which I obtained using Nominatim.

Once I had these, I executed a hotel query with the Foursquare API to return hotels near the city centre directly, and it returned a JSON file with several nested dictionaries and lists further nested within a dictionary. I decided the quickest way to create a dataframe would be to iterate through this data and extract the relevant values (Hotel_Name, Type_of_Accomodation, Latitude, Longitude) from the nested dictionaries/lists, create one list per column and finally merge them into a single dataframe. I wrangled this data to exclude hostels, boarding houses and an instance of a hotel pool which appeared during the search. I also converted the coordinates into the datatype “float64”, so that they could be mapped Once I was satisfied with the format of the data, I used a Folium map to plot the hotels onto a map of Lisbon.

The Foursquare API query:

An example of the iteration to retrieve the latitude and longitude values as lists:

Joining the lists into a dataframe:

Part of the hotel dataframe:

Mapping the hotels:

The map of hotels:

I then executed a query on the Google Places API to find the Fado venues in Lisbon, and it returned a JSON file with nested lists of the data. As before, I iterated through the lists to retrieve the venue name, latitude and longitude, and combined these lists into a single dataframe as shown below:

The Fado Dataframe:

Once again, I converted the latitude and longitude data points into the datatype “float64” and mapped them onto the map of Lisbon:

I then produced a map with the hotels and Fado venues shown alongside each other:

Because I am looking for clusters based on distance, I decided to use the DBSCAN algorithm as this allows one to input latitude and longitude data and use the haversine distance between points to form the clusters.

Firstly, I had to combine the latitude and longitudes within my dataframes into a 2D array, as this is the required format for applying the DBSCAN algorithm to the type of data I am inputting:

Then I used DBSCAN on these datapoints, to find which venues could form part of a cluster from where you could reach 14 other venues (hence min_samples = 15.)

Because I was using latitude and longitude coordinates with real world distance in metres, I used numpy radians when I fit the model and set the eps parameter to 0.25/6371 which produced the distance of 250m in radians. I also used the ball-tree algorithm because I am working within a 2D space.

I then plotted my results, you will see the outliers represented as ‘x’ below:

I then created a dataframe with the coordinates of my outliers so that I could exclude them from the final dataframe, from which I would plot the heatmap showing the cultural and tourism hotspot.

Using an outer join to exclude the outlier latitude and longitudes from a dataframe of all the latitudes and longitudes examined, and making a list for the heatmap:

Lastly, I plotted all the hotels and Fado venues, onto a Folium heatmap showing the venues that fall into the cultural and tourism hotspot in Lisbon:

Results

Based on the heatmap produced and shown below, several interesting observations can be made:

Firstly, being familiar with Lisbon I expected to see a large concentration of Fado districts in the Alfama neighbourhood, but indeed there are few hotels located there as it is mostly a residential neighbourhood. Furthermore, it was unsurprising that many of the hotels are located just North of the centre of Lisbon (known as the Praça do Comércio.)

What is clear from the analysis performed is that the cultural and touristic hotspot for Fado lies neither in the most famous Fado district of Alfama, nor in the centre of the city itself where a majority of the hotels are located. Instead, it lies at the junction between the neighbourhoods of Santa Catarina, Encarnaçao and Sacramento, where a person can find numerous Fado venues and Hotels within a short walk from the centre of the hotspot (shown in red/orange).

Discussion

Based on the geolocation data for hotels and Fado venues obtained from the Foursquare and Google APIs respectively, and having applied a DBSCAN unsupervised classification algorithm to cluster hotels and Fado venues based on a distance of 250m, I recommend that my client considers the cultural and touristic hotspot located at the juncture of the Santa Catarina, Encarnaçao and Sacramento neighbourhoods for their venture.

It is worth noting that while the Alfama district is home to most of the Fado venues, and most of the hotels are found just north of the Praça do Comércio, these sites don’t really offer the “best of both worlds” for someone looking to avoid walking over the many hills of Lisbon in order to benefit from both tourism and the appeal of Fado venues.

One caveat is in order, however. Given that this study relied on data derived from the free versions of Foursquare and Google APIs, and only returned one hotspot, future researchers could benefit from applying a similar method to larger datasets, and could always adjust the parameters of the DBSCAN algorithm for more/fewer venues within a greater/shorter distance based on their particular use case.

Conclusion

In conclusion, this study has demonstrated the use of a clustering algorithm to identify a hotspot of activity between two types of venues within a city. In this case, I identified that the nexus of the Santa Catarina, Encarnaçao and Sacramento neighbourhoods in Lisbon represents a viable location for someone wishing to open a cultural centre to bring the beauty of Fado to people visiting the city on seven hills, all within a comfortable walking distance of numerous hotels and Fado venues.