Open Data in Software Development
How software improves when high-quality open data is available for free.
No time to read the whole story? 😁 Here are the main take-aways — aka my personal hypotheses — from this blogpost:
Open Data in Software Development
- Open Data becomes more important at all stages of the software development process — also due to agile processes
- Easy and fast access to Open Data is critical for software development
- Open Data is to be preferred over manually created test data which is time-consuming and does not reflect reality
Software Development with Open Data
- Open Data improves the quality of software, especially for the end-users
- Open Data leads to more inclusion and diversity in the development process as it avoids overfitting
- Open Data enables innovation and more efficient development cycles
Feel free to challenge these hypotheses in the comments below 👇
About a year ago, I was invited to an Open Data Beer in Zurich where I spoke about how we use Open Data at the Esri R&D Center Zurich. I finally found some time to work on this write-up and looking forward to hearing your thoughts about this and how you are using Open Data in your daily work 😊
Open Data
The sheer amount of data that is being captured, stored, and made available today is beyond anyone’s imagination. The data volume in 2020 is measured in multiple zettabytes — a number with 21(!) decimals. Of course, most of this data is owned by private companies used to track users, fed to recommendation systems, calculate business performance indicators, etc.
Only a small subset is made available as Open Data, mostly data captured and maintained by government, universities and research institutes, as well as non-profit organizations. Even some private companies work on making Open Data accessible to the public as an example of Google’s research database shows that can query over 25 Mio. freely available datasets. There are several global initiatives to drive the accessibility of government data — technically it belongs to the people anyway as they’ve paid for it with their taxes 😎. But Open Data is not just data that is for free — it also needs to be machine-readable, well documented, and up-to-date. If you are interested in learning more, you can read more about the requirements for data to become Open Data here.
To achieve the goals of sustainable development, critical data must be open and available for reuse by anyone, anywhere, anytime. — Tim Berners-Lee
Frankly speaking, Zurich is an Open Data paradise 😍. Already as a student, I made use of the amazing data made available by both the City of Zurich and the Canton of Zurich. For example, in one of our semester projects we built a fun location-based game called Pioneer — clearly inspired by the board game “Settlers of Catan” — where we used Open Data to display sources of water, wood, and metal connected to real-world objects in the city such as fountains or trees. It was extremely easy to find, download, and use the data for our mobile app.
In my experience now as a Software Developer, the lack of available high-quality data tends to, unfortunately, stop exciting and innovative projects from going further — especially when exploring the machine learning universe of endless possibilities. We also have a strong focus on geospatial data which is even rarer to be found as it typically results in higher maintenance.
BUT I really don’t want to complain 😁 because we still get to work with incredible data that helps us all along the software development life cycle.
Software Development
When looking at classical software architecture, data is the foundation and — in my opinion — the most crucial part for software to work well for its intended use. High quality and continuous accessibility of the data is key for a successful and reliable product. Think of Covid-19 related databases, e.g. the official one from WHO. All these dashboards, maps & tracking apps built on top of this data would be meaningless (if not dangerous) if they wouldn’t display data that is of high quality, well-maintained, and up-to-date.
But data is not only the foundation of most software products, it is also used throughout the software development life cycle to increase the quality of software. Of course, data is being used throughout the whole life cycle. In the following I will be showing you different examples from three stages of the software development life cycle where we are frequently using Open Data:
- Open Data for Analysis
- Open Data for Implementation
- Open Data for Testing & Integration
I believe that the way software is pre-dominantly developed today — in an agile way— has made access to Open Data more important than it used to be. Less time available for each stage of the development cycle and a lower level of specialization of the people involved require easy, fast, and open access to relevant datasets.
Open Data for Analysis
In the stage of analysis, there are a few things given: resources are low, time is scarce, and design iterations are short and frequent. In this phase, the functionality of the software is sketched out, for example in form of a prototype. The code produced for the prototype is often disposed of afterward. The main goal is to find all the functionality we want in the final product & most importantly all the roadblocks and functionality we don’t want.
The outcomes of this phase are usually POC’s — proof of concepts. Many design iterations lead to a visual and vivid prototype where everybody involved can imagine the final feature’s look & feel. At the same time, the feasibility of the implementation has been explored in order to make assumptions about workload, technical challenges, dependencies, etc.
⭐ Example: Interactive data exploration is a great way to make users better understand their data, for example in a city environment. In a 3D space, exploration can be quite complex and the main challenge is to keep the visual clutter low while presenting all the necessary information to the user. The below image shows a prototype that we have built to explore this use case further. It uses the openly available, amazingly detailed 3D buildings of the city of Zurich to map space use onto it and display it throughout the city.
If test data is needed in this phase, it is often mocked — mainly due to time constraints. It is either created manually from scratch (often overfitting the use case) or generated automatically with some assumptions in mind. Sometimes this results in rather unrealistic data — as you can imagine 😆 — with smaller or larger implications.
✔️ Why is it important to use Open Data for this stage in the software development process? So that early (design) decisions can be made with realistic information at hand. Realistic test data can show you immediately what you will have to deal with later on for implementation. Additionally, realistic data will prevent you from overfitting your data to the use case.
✔️ And what type of Open Data is needed for this stage? Datasets that are easy to understand and parse, and easy to embed and use in prototypes. If it takes you two days to understand the schema of the dataset & another three days to download and convert the data to the right format, then you probably won’t end up using it for your prototype.
Open Data for Implementation
When implementing new features and functionality, these need to be continuously tested out in the *wild* — aka with unseen data. I have been working with data for a while now and I strongly believe that nothing is impossible, especially not with data. As developers, we often assume nearly perfect data. But there’s no such thing as perfect data — unless you don’t allow your users to create and manipulate their own data (lucky you! 😁) and even then you aren’t safe from invalid data entries.
When implementing new features, we often think and take into account edge cases — rare occurrences of what users would want to do, or how data could look like (or both at the same time — jackpot⭐). Even though edge cases tend to be rare, they still happen & it is our responsibility as developers to catch them, handle them elegantly, and most of all: make sure they don’t crash the app.
⭐ Example: To formalize the relationships between different datasets, we made a preliminary assumption that a parcel (lot) geometry as shown below would map to one underlying zoning geometry — in 99% of the cases. When we started working with real zoning and parcel data, as shown here by the city of Zurich, it quickly turned out that parcels with multiple intersecting zoning boundaries are actually not that rare.
Coming up with all possible edge cases is a challenging exercise as we often cannot even imagine how data can look like until we have seen it. Edge cases (which are not always so rare) often only show up when you start working with real data. If you start testing your application with real-world data only after implementation (or after releasing it to your users*), you will likely miss certain edge cases.
When it comes to data, we also need to make an effort to protect the user from creating (or importing) invalid data. Catching edge cases early & being able to handle them thoughtfully (and not in a rush for a hotfix after release) will spare you a lot of trouble.
✔️ Why is it important to use Open Data for this stage in the software development process? In order to account for many different occurrences of data entries and to catch edge cases during implementation — and not after.
✔️ And what type of Open Data is needed for this stage? Rich and diverse datasets that can provide many use cases, e.g. time series or datasets from different locations.
*side note: there is of course validity in conducting beta testing or A/B testing with users before officially releasing a (new) version of your product. We have made good experiences collaborating with early-adopters before releasing the first version of our product ArcGIS Urban in order to account for a variety of user requirements and test it out with different datasets.
Open Data for Testing & Integration
The last example that I would like to mention is probably the most obvious one. We use openly available data for testing our software. One aspect of testing is scalability and performance which is especially crucial in a browser environment. Performance testing in our case needs to be done with large volumes of complex geometrical data. What we are accomplishing with that is finding the limits of the software in order to continuously push these but also not allow usage beyond the current limits (which would crash the browser). For performance testing, you could of course also use automatically created large volumes of data, and sometimes this makes sense.
⭐ Example: 3D web applications — running in a browser environment — are often limited by memory constraints. 3D data can be highly detailed, especially when showing complex geometries as well as realistic textures, and therefore fill up browser memory quickly. For our web applications that display multiple three-dimensional datasets at the same time and over relatively large areas, we needed to make sure that we are aware of the limits that the browser can handle for display. Finding those limits allowed us to first restrict certain functionality, e.g. to a maximum feature count, which still enabled most of the use cases — but not all of them. Then, we continuously worked to expand these limits, e.g. by using tile-based data streaming, using different levels of details when displaying data, by introducing new data formats (e.g. i3s), etc. Over time, the limits could be set higher and eventually even be removed completely.
Testing is done to ensure the quality of the software, but also to find areas of improvement. Innovation often sparks from discovering something new — a new use case or a new requirement. I believe that automatically generated and potentially overfitting datasets do not leave a lot of room for improvement. Realistic data on the other hand can show you new ways to help solve problems that your users are dealing with.
✔️ Why is it important to use Open Data for this stage in the software development process? To find the limits of the software, to expose it to diverse data, to ensure quality, and to spark innovation.
✔️ And what type of Open Data is needed for this stage? Large and rich datasets, complex geometries, the same type of data in different locations, and sometimes even “unpolished” data can be helpful.
Final Thoughts
Of course, there are many more examples of how Open Data is used for and in software — above all applications that are making direct use of Open Data for analysis and or context. Apart from the above-mentioned examples, we are also always looking for openly available datasets that can be used for machine learning, e.g., training algorithms on various test datasets. So far, this approach has not been as fruitful as we had hoped as there often isn’t enough data available (yet?) to build reliable systems. Nevertheless, we already extensively use Open Data for samples, prototypes, showcases, and — whenever possible — to improve our software development processes as described above.
I strongly believe that software development can highly benefit from Open Data — and vice-versa. When thinking about the use cases of openly available data, software development is maybe not the first to pop up in people’s minds. However, also in research, this use case has recently been picked up (see quote & article below) and analyzed in order to improve the formats and accessibility of Open Data — specifically taking software development as a use case into account.
The way open data resources of varied type and volume are used by software applications remains only partly known. —”Towards Increased Understanding of Open Data Use for Software Development” by Maciej Grzenda & Jaroslaw Legierski, in Information Systems Frontiers (2019)
Therefore, I hope that I was able to successfully explain here how we are using Open Data, and show that this is a relationship worth exploring. In hopes that more and more Open Data will become available in the future and software development can continuously be improved by making use of it. Thanks for reading along 😍
—
I am looking forward to hearing your thoughts and experiences about the use of Open Data in software development 🙌 comment below or reach out to me if you’d like to continue the conversation!