User-generated content is a valuable resource for capturing all aspects of our environment and lives, and dedicated Volunteered Geographic Information (VGI) efforts such as OpenStreetMap (OSM) have revolutionized spatial data collection. While OSM data is widely used, considerably little attention has been paid to the quality of its Point-of-interest (POI) component. This work studies the accuracy, coverage, and trend worthiness of POI data. We assess the accuracy and coverage using another VGI source that utilizes editorial control. OSM data is compared to Foursquare data by using a combination of label similarity and positional proximity. Using the example of coffee shop POIs in Manhattan we also assess the trend worthiness of OSM data. A series of spatio-temporal statistical models are tested to compare change in the number of coffee shops to home prices in certain areas. This work overall shows that, although not perfect, OSM POI data and specifically its temporal aspect (changeset) can be used to drive urban science research and to study urban change.
feasibility study sample for coffee shop pdf download
Various datasets are used to provide for a sound quality assessment of OSM POI data. We assess (i) the accuracy and coverage of OSM as a POI data source by using/comparing it to respective Foursquare data, and we assess (ii) its trend worthiness by building statistical models that relate the change in coffee shops to a change in home prices. The following sections discuss each data source as well as the pre-processing steps needed to make the data actionable.
For our work, we are interested in changeset data that reflects actual real-world change, i.e., the addition or deletion of a coffee shop node in close temporal proximity to the actual opening or closing of the coffee shop. Since we do not have ground truth data, i.e., municipal records, we use quantitative information such as the monthly coffee shop count plot in Fig 2, which shows the overall change in OSM data.
Unfortunately, Foursquare does not provide access to their entire database in a fashion similar to OSM. An API with limited service rates allows one to interact with the service and in our case to retrieve POI information. To account for some API limitations, we use two types of queries as detailed in Tables A and Table B in S1 Appendix. Type I uses the collected OSM data to retrieve all Foursquare POIs that have a matching label within a 50m radius. Type II uses a regular spatial grid (200m spacing) to retrieve all coffee shop POIs in relation to the centroid of each cell. This strategy ensures that less than 50 POIs are retrieved per request (Foursquare limit), while it also covers POIs that are not in the immediate area of our OSM POIs. The mapping of categories of OSM data and Foursquare data is shown in Table C in S1 Appendix. The resulting Foursquare POI dataset is obtained by fusing these two datasets. The location of all OSM POIs for query Type I and grid center points for Type II are shown in Figs 5 and 6.
POIs are typically captured as point locations and the expectation is that the recorded coordinates do not match up exactly across data sources. To examine whether two coordinates capture the same POI (and without considering the label) one can use a buffer region to see whether one location is close to the other. The choice of a proper threshold is critical, as various coffee shops with the same name (chain) might be close by. Using a projected coordinate system, Euclidean distance can be used. Fig 9 gives an overview of the overall processing pipeline.
We first introduce the concept of scaling relationships to model urban phenomena based on population size. This allows us then to relate change in coffee shop numbers and home prices to population and to eventually model the direct relationship between them. We use a range of spatial and temporal analysis methods and adjustments to try and improve the overall model fit. Section 5 will finally tell us the adjustments that work best and consequently the model that has the best fit.
Power low relationship between coffee shops and home prices. Coffee shop numbers correlate with population numbers over time and place and can be treated as an indicator of human activity. Thus, they should follow a power law function of population. On the other hand, the real estate market is an economic phenomenon also related to human activity. With Eqs 3 and 4, we have two relations between coffee shops c and population, and between home prices p and population: log(ct) = log(c0) + βc log(Nt) and log(pt) = log(p0) + βp log(Nt). In combining them, we infer a function between coffee shops and home prices. Eq 5 also represents a power low scaling relationship and can be fitted using regression techniques. As a side benefit, since both datasets are user generated, it would also establish the usefulness of such data for the investigation of urban phenomena.
Fig 10 shows that different neighborhoods over time (shown using different colors and symbols) exhibit different patterns. The home prices of each neighborhood do not always increase as the coffee shop density increases. Fig 11 shows seasonal patterns across all neighborhoods and they seem to be more consistent. Different sub-figures represent different seasons in different years. Home prices seem to increase with coffee shop density for each season in general. In the lower-left corner, both, coffee shop density and home prices are low. In the upper-right corner, home price and coffee shop density have diverging trends.
Temporal lag. The relationship between coffee shops and home prices could be that an increase of coffee shops either (i) leads to higher home prices, (ii) coincides, or (iii) follows home prices. As such, we can add a lag variable for coffee shop density to our model.(8)
In our case of incomplete knowledge and variables, space might be a proxy for the missing information and spatial trend analysis might still be valuable for our model. Since coffee shop density could follow a spatial trend, a collinearity issue might exist between location and coffee shop density. The updated model using a polynomial for the x and y coordinates is as follows.(9)
Modeling approach. In our study, we build models starting with a basic model up to involved models that include additional regressors. We do so in order to observe whether subsequent adjustments add power to the model. The simplest model uses a mean value of coffee shop density and home prices across all seasons. One model is built for each season. We use this simple approach to show the applicability of a scaling relationship model.
Next, a comprehensive model with data for all neighborhoods and seasons is introduced. It includes coffee shop density as an independent variable. Based on this model, different independent variables and spatial autoregression methods are added or deleted based on their respective p-value and modeling power.
To assess the performance of our various models, we choose the Akaike Information Criterion (AIC) [53] and the Bayesian Information Criterion (BIC) [54] instead of the typical R-squared metric since AIC and BIC are more resilient to overfitting [55]. In our modeling approach, there are two sources of complexity. The first is an increasing number of different independent variables. The second source comes from the fact that polynomial models have the potential risk of being overly complex for our modeling case. Both AIC and BIC are information-based criteria that assess model fit as metrics for selecting a finite set of models. They both maximize likelihood and penalize an increasing number of parameters and complexity. As such, both are resilient to overfitting and they are widely used for model comparisons in modern statistics. In general, the criteria for model comparison is that lower AIC or BIC values indicate a better model fit. Normally, BIC will give a higher score as it penalizes model complexity more than AIC. One strategy to add or eliminate a variable is that if a variable (i) is not considered statistically significant (p-value), or (ii) a model using it has a similar or worse AIC or BIC value than a simpler model, then this variable would not be included in the next (improved) model. Another strategy is that if coefficients of coffee shop density change significantly when new variables are added, the model is considered a bad one, since they eliminate the contribution of coffee shop density in our model. This reasoning relates to the discussion of spatial trends and spatial auto-regression and that these models might implicitly (location) estimate coffee shop density.
For our first model case, we try to fit one scaling relationship model per season. Additionally, we use one baseline model for the mean value of coffee shop densities and home prices across seasons. The two variables are shown in Fig 15. The blue line is the fitted regression line. The grey area is the confidence interval of the predicted values. The coefficients and model fitting results of Table 1 suggest that across different seasons the scaling factor β is stable at around 0.30. All the models have very small p-values (less than 0.01), which indicates a good fit. These results are a strong indication towards the existence of a scaling relationship between coffee shop densities and home prices. Fig 16 shows the normal probability plot of residuals, which follows a normal distribution with a high goodness-of-fit (R-squared = 0.9389).
As mentioned in Section 4.2.1, M1 is the basic model considering only coffee shop density as independent variable. β is estimated to be 0.3032, which is almost the same as in the case of the simple model using mean values. We will use this model as a baseline in our comparison to assess the various adjustments.
Temporal lag modeling (M3) did not perform well, since all of the lag parameters have significant p-values, even though AIC slightly dropped to 18.0 (from 19.8 in M1), BIC is 33.3 (increase from 28.7 in M1). This also shows that M3 does not improve over M1. Changes to coffee shop density have no effect beyond a single season (three months period). 2ff7e9595c
Comments