Documentation | DataPancake

Pancake Documentation

Understanding Challenges with Semi-Structured Data

Broadly speaking, Pancake is designed to convert semi-structured hierarchical data into relational data using a deep recursive scanning process and generating SQL statements to create flattened out Dynamic Tables.

The following sections will give you a brief overview of the problems Pancake is designed to solve, as well as information about the problems themselves to help you understand them in more depth.

Here are some of the problems that Pancake directly addresses:

Polymorphic Data
Schema Discovery
Parsing & Querying Semi-Structured Data
Schema Evolution/Drift

Polymorphic Data

Polymorphic data refers to attributes in a semi-structured document that have different structures or data types. This can occur when documents are receiving raw data from multiple sources, or changes are made to the underlying schema. It can easily result in queries which miss large portions of raw data or impact downstream features and analytics.

The following are examples of polymorphic attributes found in a single JSON document, they are intended to provide a basic understanding of polymorphism in data but are by no means exhaustive:

address (string)
property_type (string):

{
"_id": "re-8594abcd-3217-4889-ef01-23b45cd6789f",
"property_id": "PR-5689",
"property_type": "Mixed-use",
"address": "123 Main St"
}

Address (list) object array
property_type (object):

{
"_id": "re-8594efgh-3217-4889-ef01-23b45cd6789f",
"property_id": "PR-5690",
"property_type": {"usage":"Commercial", "sq_ft":20000},
"address": [{
"street": "1234 High Tower Rd",
"city": "Skyline",
"state": "NY",
"zip_code": "10101"
}]
}

amenities (list) both primitive and object array:

"amenities": [
"Gym",
"Rooftop Deck",
{
"name": "Conference Rooms",
"availability": "Reservable",
"capacity": 20
}
],

Issues like the examples above are very common with a flexible format like JSON. Pancake is designed to help you avoid the issues which result from that flexibility resulting in poor data quality.

Schema Discovery

When parsing JSON data, it is frequently difficult to know with certainty that the entire schema has been fully documented. Schemaless structures can change at any time, and hierarchical structures like JSON have no limit to the depth of nested objects and arrays. The same things which make JSON a desirable storage format also create challenges for organizations hoping to leverage that data for applications and analytics.

Parsing and Querying Semi-Structured Data

Snowflake SQL provides excellent support for parsing and processing data, but complex JSON structures require equally complex queries if users want to unpack, unnest, flatten, and relate the data found in them. That level of data complexity and SQL can be difficult to test and troubleshoot. If attributes are not notated precisely you will receive a null value, and can easily exclude data from your result set if you are not familiar enough with the underlying data to know the value should not be null.

Schema Evolution (or Schema Drift)

The structure or schema of JSON can change easily and often accidentally, making it difficult to track and process the changes to upstream data sources. Schema drift can create sudden problems as downstream pipelines are impacted or even broken. This causes delays to reporting, analytics and even product rollouts when features depend on data from the JSON source. Applications that are built on document databases which use Snowflake as their analytics platform and repository suffer as a result of these uncommunicated and often overlooked schema changes. Worse, if the schema has not changed but data has inadvertently become polymorphic due to data migration errors, these types of adverse impacts can occur without something as obvious as a broken dashboard to call attention to the issue.

Application Overview

Warehouse

In Pancake, warehouses simply refer to the specific virtual warehouses in your Snowflake account that you make available to the application as compute resources. As part of the application setup, Pancake is granted privileges which allow you to maintain control over the creation and management of your virtual warehouses via the application.

This object is what allows Pancake to use the warehouses you make available to the app for scanning. Our recommendation is that you set up a batch of warehouses used only by Pancake to handle concurrent scans and to cover a range of document sizes and complexity.

We advise that you follow a naming convention of pancake_[warehouse size]_[number]

(e.g. pancake_medium_snowpark_optimized_3) for ease of selection and use within the app or in a worksheet.

Data Source

Pancake currently supports connecting data sources which are a valid JSON from VARIANT columns that store one object per row. It can be added from a table, external table, or view. That JSON object can contain any number of attributes of any type and any number of nested levels. Pancake’s support for external tables allows users to scan JSON data that is currently in cloud data storage such as an S3 bucket or Azure Blob without needing to move that data into Snowflake.

Data sources must be valid JSON to be added to Pancake. Documents containing invalid JSON will produce an error and not be added to the application.

Scan Configuration

In Pancake, a Scan Configuration is a stored set of parameters for scanning a specific data source. Users may wish to configure quick scans when onboarding data sources, full scans at regular intervals if there is a chance of schema changes across the whole document, or configure a scan with a where clause to only scan for changes since the most recent scan. Scan configurations are flexible, and designed to give users granular control over how compute resources are utilized within the app.

**NOTE: Pancake makes very efficient use of Snowflake’s compute resources, and depending on document complexity can scan approximately 1,000,000 documents per minute on a Medium Snowpark Optimized Warehouse. However, Snowflake will automatically terminate a procedure after 3600 seconds, so users must right-size their Scan Configurations.

Scan and Scan History

Typically users will configure scans with a cron schedule, but scans can also be initiated manually by choosing a data source and scan configuration and pressing the Start Scan button on the Scan Data Source Page. Selecting a data source and scan configuration on that page will also reveal the scan history for the 10 most recent scans of that combination. The complete Scan History for all Scan Configurations is available from the Pancake View VW_DATASOURCE_SCANS found in the View menu of the application database within PANCAKE.CORE.

Attribute Metadata (Discovered)

Data sources that have been upgraded to the Schema Analysis tier will have access to attribute metadata generated from Pancake’s scan. The Schema Analysis provides details of the following:

attribute path
attribute name
polymorphic name
attribute level
attribute order
data type
sample value
array type (object or primitive)
primitive array yes/no
primitive array data type and date of first scan.

This data is also available outside the app UI via the following two views:

VW_DATASOURCE_ATTRIBUTES
VW_DATASOURCE_VERSION_ATTRIBUTES (shows attributes by version or scan)

Dynamic Table Generation Metadata (Coming Soon)

Once a data source has been scanned at the Schema Analysis tier, users can use the Dynamic Table Generation page to select that datasource and configure the additional properties for each attribute, which is required to generate the SQL statements to extract and flatten that JSON source into Dynamic Tables.

Users will be asked to validate, specify, and/or configure the following:

Snowflake data type
Precision/Scale
Datetime format

These have default values based on inference from the scan, but the user may want to adjust them based on their understanding of the data. Users will also select and configure the attributes which will be carried down to each nested array. This allows for the Dynamic Tables to be represented as relational streams.

Ex: You are dealing with a JSON with retail sales information with an object that contains an orders array. You can select the customer_id column from the root level to include in the flattened our orders array. This will allow you to join the two Dynamic Tables based on the customer_id.

Users can access the generated code from the Dynamic Table Generation page as well as [VIEW].