Behind the Scenes: Creating an Analytics Pipeline with Python Polars (Part 1)

Understanding the user journey is essential for any tech-enabled online platform.

We created an online platform where users can invest in mutual funds. On our platform, users can explore funds from various Asset Management Companies (AMCs) and choose the ones that suit them best. The process includes online KYC verification and finalizing transactions to invest in the selected funds. This journey is long and complex, leading to potential drop-offs at various stages. To improve the user experience, the analytics team must map this journey and identify friction points or bugs.

Here’s how we built our analytics pipeline to track the entire journey, identify drop-offs, and prepare daily business reports.

Events Table

The Events Table is the foundation for understanding the user journey. Every click, scroll, and action on the platform is logged, along with data from third-party webhooks. Here’s the table’s structure:

AnonymousID	CustomerID	EventName	EventMeta	Timestamp

AnonymousID is a temporary identifier assigned to users when they first visit the platform. If they log in, the CustomerID is also populated.
EventName indicates the specific action, such as a page visit, a button click, or a transaction confirmation.
EventMeta holds additional event data in JSON format, which can include various parameters like transaction amounts, page paths, or other details.

Example of EventMeta:

{"amount": "1000.00", "AMC": "Quant Mutual Fund", "transaction_charges": "2.00", "mode_of_payment": "UPI"}

Events Enriched Table

While the Events Table provides raw data, it’s challenging to extract insights directly from it. We created an enriched version with additional transformations to simplify analysis.

Creating a Flattened Table

To reduce complexity, we extracted key information from EventMeta into separate columns, creating a “wide but slim” table, also known as a Flattened Table. The flat structure allows analyst to write simple SQL queries using WHERE column_name = expected_value statement, instead of nested queries to extract JSON values

The flat structure allows analyst to write simple SQL queries using WHERE column_name = expected_value statement, instead of nested queries to extract JSON values

Here’s a Python snippet illustrating the flattening process with the Polars library:

variables_to_flatten = ['amount', 'AMC', 'transaction_charges']
dtype = pl.Struct([pl.Field(variable, pl.String) for variable in variables_to_flatten])

events_enriched = (
    events
    .with_columns(decoded=pl.col("EventMeta").str.json_decode(dtype))
    .unnest('decoded')
)

Stitching Missing Values

Due to technical issues, some rows in the Events Table may have missing data. Here’s an example:

AnonymousID	CustomerID
AA1	C1
AA1	C1
AA1
AA1
AA1	C1

In the above case, CustomerID is missing for some rows. Since AnonymousID AA1 corresponds to CustomerID C1, we can stitch the missing values by filling them from adjacent rows.

cols_to_stitch = ['CustomerID']
events_enriched = events_enriched.with_columns(pl.col(x).forward_fill().backward_fill().over('AnonymousID') for x in cols_to_stitch)

The result after stitching:

AnonymousID	CustomerID
AA1	C1
AA1	C1
AA1	C1
AA1	C1
AA1	C1

Persisting Values

In the journey, users make choices that can impact the subsequent steps. To capture these choices, we persist certain values for the remainder of the journey.

Here’s an example showing the impact of value persistence:

CustomerID	EventName	MutualFundSelected
C1	Landed
C1	MFSelected	Quant Active Fund
C1	InvestmentAmount
C1	KYCStarted
C1	TransactionCompleted

Even though MutualFundSelected is recorded only once, it persists throughout the journey until it changes. To persist such values, we use forward-filling:

pythonCopy code
cols_to_persist = ['MutualFundSelected']
events_enriched = events_enriched.with_columns(pl.col(x).forward_fill().over('AnonymousID') for x in cols_to_persist)

Result after persisting:

CustomerID	EventName	MutualFundSelected
C1	Landed
C1	MFSelected	Quant Active Fund
C1	InvestmentAmount	Quant Active Fund
C1	KYCStarted	Quant Active Fund
C1	TransactionCompleted	Quant Active Fund

Difference between Stitching and Persisting

Stitching and persisting are similar in that they both fill missing values, but they differ in their approach. Stitching includes both backward and forward filling, while persisting is only forward filling

Stitching: This technique is used when data might be missing due to technical glitches. It involves both forward-filling and backward-filling to ensure continuity of information, since the missing values can be confidently derived from adjacent rows.
Persisting: This technique is applied when certain values are expected to remain the same throughout the journey until they change. It uses only forward-filling because it’s uncertain what the value was before its first occurrence.

Stitching includes both backward and forward filling, while persisting is only forward filling

Extracting Key Events

To track business KPIs, we focus on three critical events:

Users landing on the invest section.
KYC initiation.
Transaction completion.

By creating separate columns for these events, we can easily monitor them:

special_events = {'landed': 'Landed', 'KYC': 'KYCStarted', 'Transacted': 'TransactionCompleted'}
for k, v in special_events.items():
    events_enriched = (
        events_enriched.with_columns(
            pl.when(pl.col("EventName") == v)
                .then(pl.lit(1))
                .otherwise(pl.lit(0))
                .alias(k)
        )
    )
    events_enriched = (
        events_enriched.with_columns(
            pl.when(pl.col(k) == 1)
                .then(pl.col('Timestamp'))
                .otherwise(None)
                .alias(f'{k}_timestamp')
        )
    )

Resulting table after extracting key events:

CustomerID	EventName	Timestamp	Landed	Landed_Timestamp	KYC	KYC_Timestamp	Transacted	Transacted_Timestamp
C1	Landed	1:00PM	1	1:00PM	0	None	0	None
C1	MFSelected	1:25PM	0	None	0	None	0	None
C1	InvestmentAmount	1:26PM	0	None	0	None	0	None
C1	KYCStarted	1:28PM	0	None	1	1:28PM	0	None
C1	TransactionCompleted	1:50PM	0	None	0	None	1	1:50PM

Conclusion

The Events Enriched Table is the first step in building an analytics pipeline that helps map the user journey and identify drop-offs. This enriched table simplifies the analysis process by extracting and stitching critical information, persisting key values, and separating notable events for easy tracking. Although this table provides a wealth of data, advanced SQL skills and considerable compute power are required to extract insights.

In the next step of the analytics pipeline, I’ll discuss the Applications Table, where each row represents a single user application to invest in mutual funds, with all related information stored in a wider column format. This simplification allows for more streamlined analysis and easier business tracking.

Disclaimer: The above scenario is simplified and fictitious but closely resembles real-world scenarios often encountered when mapping customer journeys in online investment platforms.

Behind the Scenes: Creating an Analytics Pipeline with Python Polars (Part 1)

Events Table

Events Enriched Table

Creating a Flattened Table

Stitching Missing Values

Persisting Values

Difference between Stitching and Persisting

Extracting Key Events

Conclusion

Published by Hitesh

Leave a Comment Cancel reply

Events Table

Events Enriched Table

Creating a Flattened Table

Stitching Missing Values

Persisting Values

Difference between Stitching and Persisting

Extracting Key Events

Conclusion

Share this:

Related

Published by Hitesh

Leave a Comment Cancel reply