
Understanding the user journey is essential for any tech-enabled online platform.
We created an online platform where users can invest in mutual funds. On our platform, users can explore funds from various Asset Management Companies (AMCs) and choose the ones that suit them best. The process includes online KYC verification and finalizing transactions to invest in the selected funds. This journey is long and complex, leading to potential drop-offs at various stages. To improve the user experience, the analytics team must map this journey and identify friction points or bugs.
Here’s how we built our analytics pipeline to track the entire journey, identify drop-offs, and prepare daily business reports.
Events Table
The Events Table is the foundation for understanding the user journey. Every click, scroll, and action on the platform is logged, along with data from third-party webhooks. Here’s the table’s structure:
| AnonymousID | CustomerID | EventName | EventMeta | Timestamp |
|---|---|---|---|---|
- AnonymousID is a temporary identifier assigned to users when they first visit the platform. If they log in, the CustomerID is also populated.
- EventName indicates the specific action, such as a page visit, a button click, or a transaction confirmation.
- EventMeta holds additional event data in JSON format, which can include various parameters like transaction amounts, page paths, or other details.
Example of EventMeta:
{"amount": "1000.00", "AMC": "Quant Mutual Fund", "transaction_charges": "2.00", "mode_of_payment": "UPI"}
Events Enriched Table
While the Events Table provides raw data, it’s challenging to extract insights directly from it. We created an enriched version with additional transformations to simplify analysis.
Creating a Flattened Table
To reduce complexity, we extracted key information from EventMeta into separate columns, creating a “wide but slim” table, also known as a Flattened Table. The flat structure allows analyst to write simple SQL queries using WHERE column_name = expected_value statement, instead of nested queries to extract JSON values

The flat structure allows analyst to write simple SQL queries using WHERE column_name = expected_value statement, instead of nested queries to extract JSON values
Here’s a Python snippet illustrating the flattening process with the Polars library:
variables_to_flatten = ['amount', 'AMC', 'transaction_charges']
dtype = pl.Struct([pl.Field(variable, pl.String) for variable in variables_to_flatten])
events_enriched = (
events
.with_columns(decoded=pl.col("EventMeta").str.json_decode(dtype))
.unnest('decoded')
)
Stitching Missing Values
Due to technical issues, some rows in the Events Table may have missing data. Here’s an example:
| AnonymousID | CustomerID |
|---|---|
| AA1 | C1 |
| AA1 | C1 |
| AA1 | |
| AA1 | |
| AA1 | C1 |
In the above case, CustomerID is missing for some rows. Since AnonymousID AA1 corresponds to CustomerID C1, we can stitch the missing values by filling them from adjacent rows.
cols_to_stitch = ['CustomerID']
events_enriched = events_enriched.with_columns(pl.col(x).forward_fill().backward_fill().over('AnonymousID') for x in cols_to_stitch)
The result after stitching:
| AnonymousID | CustomerID |
|---|---|
| AA1 | C1 |
| AA1 | C1 |
| AA1 | C1 |
| AA1 | C1 |
| AA1 | C1 |
Persisting Values
In the journey, users make choices that can impact the subsequent steps. To capture these choices, we persist certain values for the remainder of the journey.
Here’s an example showing the impact of value persistence:
| CustomerID | EventName | MutualFundSelected |
|---|---|---|
| C1 | Landed | |
| C1 | MFSelected | Quant Active Fund |
| C1 | InvestmentAmount | |
| C1 | KYCStarted | |
| C1 | TransactionCompleted |
Even though MutualFundSelected is recorded only once, it persists throughout the journey until it changes. To persist such values, we use forward-filling:
pythonCopy code
cols_to_persist = ['MutualFundSelected']
events_enriched = events_enriched.with_columns(pl.col(x).forward_fill().over('AnonymousID') for x in cols_to_persist)
Result after persisting:
| CustomerID | EventName | MutualFundSelected |
|---|---|---|
| C1 | Landed | |
| C1 | MFSelected | Quant Active Fund |
| C1 | InvestmentAmount | Quant Active Fund |
| C1 | KYCStarted | Quant Active Fund |
| C1 | TransactionCompleted | Quant Active Fund |
Difference between Stitching and Persisting
Stitching and persisting are similar in that they both fill missing values, but they differ in their approach. Stitching includes both backward and forward filling, while persisting is only forward filling
- Stitching: This technique is used when data might be missing due to technical glitches. It involves both forward-filling and backward-filling to ensure continuity of information, since the missing values can be confidently derived from adjacent rows.
- Persisting: This technique is applied when certain values are expected to remain the same throughout the journey until they change. It uses only forward-filling because it’s uncertain what the value was before its first occurrence.
Stitching includes both backward and forward filling, while persisting is only forward filling
Extracting Key Events
To track business KPIs, we focus on three critical events:
- Users landing on the invest section.
- KYC initiation.
- Transaction completion.
By creating separate columns for these events, we can easily monitor them:
special_events = {'landed': 'Landed', 'KYC': 'KYCStarted', 'Transacted': 'TransactionCompleted'}
for k, v in special_events.items():
events_enriched = (
events_enriched.with_columns(
pl.when(pl.col("EventName") == v)
.then(pl.lit(1))
.otherwise(pl.lit(0))
.alias(k)
)
)
events_enriched = (
events_enriched.with_columns(
pl.when(pl.col(k) == 1)
.then(pl.col('Timestamp'))
.otherwise(None)
.alias(f'{k}_timestamp')
)
)
Resulting table after extracting key events:
| CustomerID | EventName | Timestamp | Landed | Landed_Timestamp | KYC | KYC_Timestamp | Transacted | Transacted_Timestamp |
|---|---|---|---|---|---|---|---|---|
| C1 | Landed | 1:00PM | 1 | 1:00PM | 0 | None | 0 | None |
| C1 | MFSelected | 1:25PM | 0 | None | 0 | None | 0 | None |
| C1 | InvestmentAmount | 1:26PM | 0 | None | 0 | None | 0 | None |
| C1 | KYCStarted | 1:28PM | 0 | None | 1 | 1:28PM | 0 | None |
| C1 | TransactionCompleted | 1:50PM | 0 | None | 0 | None | 1 | 1:50PM |
Conclusion
The Events Enriched Table is the first step in building an analytics pipeline that helps map the user journey and identify drop-offs. This enriched table simplifies the analysis process by extracting and stitching critical information, persisting key values, and separating notable events for easy tracking. Although this table provides a wealth of data, advanced SQL skills and considerable compute power are required to extract insights.
In the next step of the analytics pipeline, I’ll discuss the Applications Table, where each row represents a single user application to invest in mutual funds, with all related information stored in a wider column format. This simplification allows for more streamlined analysis and easier business tracking.
Disclaimer: The above scenario is simplified and fictitious but closely resembles real-world scenarios often encountered when mapping customer journeys in online investment platforms.
