Skip to main content

Data Engineering user story and elaboration

This document is to be shared with the candidate either just before or during their technical test (check with your recruitment partner if you're not sure).

User Story

As a data scientist, I want to be able to consume a data source that contains information about how many times each of our customers buys our products in a given period, so that I can predict what they will buy next.

Elaboration

The task involves developing a data pipeline to complete the user story above using sample data sources that will be provided.

Our data science team has reached out to our data engineering team requesting we pre-process some of the data for them at scale so that they can make better use of it in their downstream algorithms. They would like us to deliver this data weekly. The input data sources are comprised of customers (in CSV format), transactions (in JSON Lines format) and products (in CSV format). Their details are presented below:

Customers

Customers is a table that contains information about customers, such as the customer ID and the date when they joined:

customer_idloyalty_score
C17

Transactions

Transactions is an ever-increasing data source that currently contains two years of transactions. Each transaction contains the customer ID, details of what products they purchased and the date of purchase:

{
"customer_id": "C1",
"basket": [
{
"product_id": "P3",
"price": 506
},
{
"product_id": "P4",
"price": 121
}
],
"date_of_purchase": "2018-09-01 11:09:00"
}

Products

Products is a table that contains information about products, such as the product ID, product description and category:

product_idproduct_descriptionproduct_category
P100redtrousers

Acceptance Criteria

The output data source should contain information for every customer that has the following fields:

customer_idloyalty_scoreproduct_idproduct_categorypurchase_count
C17P2F11
C17P3H5
C24P9H7

Further implementation details

The repository contains a starter project that includes the input data sources, a virtual environment with some dependencies you may find useful, and some basic tests to ensure the environment is ready (but only for Python).

It is recommended that candidates bring their own laptop/IDE, and they should download the code and get it running on their machine ahead of the session to avoid losing time on the day. The code and design should meet the above requirements, and should consider future extension or maintenance by different members of the team.