Leveraging billions of data points to build the future of hospitality
Share this post
This is part one of a three-part blog series on how we utilize data at Domio. We’ll describe how we built a proprietary data pipeline that Domio leverages to make pricing decisions, discover profitable real estate opportunities, and delight our guests.
At Domio we’ve built a robust data ecosystem that enables our teams to make data-driven decisions in real time, from what markets we enter to how we price our units.
We aggregate data continuously from public data sources, internal streams like Google Analytics, internal Domio inventory and transactional data, as well as partnerships with external vendors. Weaving these data points together, we end up with a rich understanding of short-term rental markets, comparables, occupancy, and inventory health across the globe.
Collecting, processing, and serving data of this magnitude is no simple task, as our internal data users expect standardized and consistent data to work with. The raw content enters our pipeline in a variety of forms and formats, including HTML, JSON, and CSV. We also have to be prepared for a number of ingestion scenarios; we could be processing millions of fetched small json payloads, massive CSVs, or real-time streams. Regardless, we make the magic happen and turn all the information into a standardized format within our central data warehouse, BigQuery.
In this next section, we’ll describe the tools and processes we’ve employed to make this transformation happen. We’ll walk through the diagram below and through our process flow
A bird’s eye view of Domio’s data ecosystem.
The tools we use
We built our ETL platform on top of a hive of Python Celery workers that buzz away 24/7 processing tasks registered on RabbitMQ. Because we have to process the many different types of data streams we previously discussed in parallel with differing levels of priority, it was important for us to split the workload across a sea of workers and queues segmented between mission-critical and lower priority jobs. This modular design also allows us to separate our ETL pipelines from the rest of the transactional jobs powering our other product streams.
This setup is a great fit for us because the tools are well integrated in the Python ecosystem, which is our main backend language. Celery also offers Flower for monitoring, Celery Beat for scheduling recurring tasks, and a built-in retry mechanism for handling failures.
Finally, we decided to build this system on Google’s Kubernetes Engine, which enables us to deploy frequent feature releases, test new ideas, and easily scale our ETL processes. We are not currently using an additional task orchestration layer like Airflow, but such a tool might make sense when we have more complex task dependency issues.
Now that we’ve outlined the tools we are using, let’s walk through the steps of how our ETL pipelines work.
The ETL process
Step 1: Extracting millions of data points
There are two ways for raw data to enter our pipeline. The first type of information is pulled on a recurring schedule from known sources. For example, our partnerships with external providers for short-term rental information or real estate inventory are fetched via API endpoints and CSV’s dumped to a cloud bucket on a recurring basis. The second type of information is real-time data which enters using a multitude of webhooks on our system.
Due to the different priority levels and quantity of content we receive, we optimize the extraction jobs by routing them to different message queues on RabbitMQ. Our swarm of Celery workers are responsible for processing tasks off their dedicated queues.
Depending on the payload size and format, we either immediately save it in an intermediary storage layer, Google Cloud Storage (GCS), or chunk it up into smaller more manageable bits before storing them to GCS. This way, we will always have the raw data on our hands to fall back to if the later transformation step fails. As we mentioned earlier, Celery has a built-in retry mechanism that allows us to intelligently retry failed tasks. In these cases and depending on the step that failed within the extraction process, we would either requeue certain tasks or refetch the raw content.
A glimpse of our Celery workers in action through Celery Flower’s interface. On this day so far 31 tasks failed, 63 were retried, and approximately 1.3 million were successfully processed.
Our error handling system, which we will discuss in a future article, will also alert us after any tasks have failed after several attempts. At this point engineers would need to intervene. Luckily for us, our extraction process has been hardened over months of continuous iterations and engineers are rarely required to step in.
Step 2: Transforming a sea of files
The culmination of our sequence of extraction jobs for a given data source is raw but organized payloads sitting in our GCS buckets. At this point, we begin queueing up our transformation tasks. This is especially trying as each payload from each data source needs its own interface in-code with the steps required to extract the information we desire and reformat it into a tabular structure that can be handled by our future load processes.
Unprocessed files are chunked by day and stored within Google Cloud Storage, our intermediary storage layer.
Although we could immediately push the clean content to BigQuery, we prefer to once again store the processed data into our GCS layer. We do this for a variety of reasons. First we’d like to ultimately batch upload the processed data into BigQuery once it’s ready, allowing for significant savings in cost. Additionally, we’ve found it to be great practice to invoke defensive design into our ETL pipeline by regularly storing data savepoints as a failed job or unexpected system downtime could lead to lost data.
Step 3: Loading beautiful, clean data
Finally, dedicated Celery workers take the processed data from GCS and batch upload them into BigQuery tables. Depending on the nature of the data, we either continuously append it to a standard table or create a new partition on a partitioned table. We are now very close to having clean and consistent data for our users to work with.
The final part in this process is checking the data quality, cleaning up any duplicate information and providing users with dedicated entry points suited for specific analytical purposes. We prevent our users from directly accessing the tables we loaded and only expose these “clean views”. Thanks to this, we can abstract away the underlying complexity and structure of our raw tables and provide our internal team’s with unchanging sources of truth.
BigQuery interface showing a few of our scheduled queries leading to standardized clean views.
BigQuery views are logical views, not materialized views. Left as-is the engine would have to run the underlying query that defines the view each time it was queried. For queries that require “real-time” data this is a requirement. However, in certain use-cases where the underlying data does not change very frequently we are able to optimize our client queries by effectively creating materialized views using scheduled queries, a beta feature within BigQuery. This lets us run queries on a timer to refresh destination tables, effectively creating materialized clean views.
The end result: sanitized, standardized, and consistent data that can be used across a wide range of analysis and visualization tools, with as little latency as possible.
Step 4: Our Business intelligence & analysis tools
Our BI tool sits atop the clean views we previously built. They provide Domio employees with the tools they need to draw insights from a range of previously disconnected sources. More analytical users write raw SQL against the tables we provide while a visual SQL builder interface allows anyone without a technical background to easily interact and leverage data to make business decisions.
Additionally, within our Domio Admin Console, we provide access to more advanced proprietary analytics tools and the ability to create bespoke reports.
Top: Correlation between different attributes of short-term rental listings.
Bottom: Correlation between ADR (Y-axis) and a target listing attribute (X-axis, redacted) for different bedroom counts across a series of months.
Pictured above are two visualizations built to answer a question for our revenue management team. They were interested in learning more about how different attributes of short-term rental listings affected it’s revenue. Thanks to the hard work from our engineering team we have a statistically large enough sample size of listings across the U.S to draw incredibly interesting insights for such a question.
In this example, we chose to use Jupyter Notebook as the visualization medium as it is reproducible, interactive, easily shareable, and has a tight integration with our essential data tools. Our use of Jupyter Notebook could be a post in of itself! For now, we’ll say it’s an incredible tool and we plan on building the infrastructure support required to host a Jupyter Hub environment. This would allow a more seamless and interactive experience for users wishing to utilize the tool without worrying about the underlying setup dependencies.
Looking into the future
In this article, we covered how we make data-driven decision possible at Domio. We aggregate millions of data points across a variety of sources and formats, and as if by magic deliver clean queryable views for Domio teams to draw insights from.
As a data scientist at Domio, I have enjoyed every single day here because of the wide array of challenges I get to tackle and the support I get from my amazing team. I get to own our data solutions from end-to-end ranging from building out our ETL pipelines, readying the data, creating various data products that teams at Domio depend upon, and promoting a data-driven culture within the whole company. Most importantly, I am able to accomplish all of this while enjoying the journey (yes it is one of our core values)!
If you found the type of engineering problems we’re tackling at Domio interesting, we are hiring!
Harry – work hard, play hard!
In part two of this three-part data series: We will take a deeper dive into how Domio utilizes this aggregated data to answer complex questions and make business decisions. We’ll explore this topic through an array of geospatial charts, heat maps, and other beautiful visual formats to understand supply and demand. Stay tuned.
A big shoutout to all those who helped me with this blog post: Duke Kisch and Richard Pavis for the slick data flow diagram; Alex Grabowski, Carlos Gil, Umer Usman, and Yasmin Kothari for awesome editing advice.
Harry is a data scientist at Domio. When he’s not wearing the data wizard hat, he can be found chilling with his golden retriever or woodworking.
Our largest unit, this one with a terrace. Perfect for big groups who like a view.
- 6 guests
- 2 bedrooms
- 4 beds
- 2 baths
Our largest unit, this one with a terrace. Perfect for big groups who like a view.
- 8 guests
- 4 bedroom
- 4 beds
- 3 baths