Introduction to the Data Manager: How to set up a Dataflow

Mathias Kolind -

The Data Manager is a powerful self-service tool for ingesting, mapping, and cleansing dataflows going into the Customer Data Platform. You find the Data Manager in the Raptor Control Panel under Tools.   

The Data Manager provides a step-by-step widget to setting up and scheduling import of dataflows from a variety of sources into the CDP. Think of the Data Manager as a window into the engine that ties your data to the Raptor system and structures it so it is ready to be activated for instance in the Audience Builder. We have taken a complex procedure and made it manageable for people without coding skills. 

That being said, you need to know your way around first of all your own data and secondly, you should have a good understanding of how and where data is utilized in the end. Luckily, your dataflow setup is validated throughout the setup guide, and you can monitor errors with Raptors Operational Insights service once the flows are up and running. 

We are adding features and improving the Data Manager at a rapid pace. Please don’t hold back input or ideas, if you experience things that can be improved or you are missing functionality or features.

This guide will get you started and explain the individual steps when setting up a dataflow. Also see separate articles on Schemas & Sources and The Dataflow Overview for further introduction to Data Manager functionality. 

 

Table of Contents

  1. Creating a Dataflow
    1. General Information
    2. Download
    3. File Mapping
    4. Transform & Map
      1. Overwrite or Append?
      2. Transform Data
      3. Map Columns to Schema
    5. Activate

1. Creating a Dataflow

Selecting the ‘Create new dataflow’ button at the top of the page or at the center of the Welcome-screen, or selecting an existing or draft-state dataflow, will take you into the Dataflow Creator. This consists of five stages which must be navigated in order. Once a stage has been completed, you may freely navigate to it when needed. Each step provides a checklist at the top, showing which mandatory fields have yet to be filled, as well as the option to save your work-in-progress as a draft.

1.1 General information   

Billede1.png

 

In this stage, you will need to fill in the most basic information about your new Dataflow. The Name (mandatory) and Description (optional) are purely for your own use. Source and Destination are key to the functionality of the Dataflow. The selected Source generally denotes where you are pulling the data from, be it website tracking, physical stores (POS), email marketing systems etc., and play a key role in identifying the finished dataflow in the Audience Builder and other Raptor applications. Destination is the Raptor component that will put the processed data to use – such as the Customer Data Platform for persondata, website tracking data or POS data, or Merchandising for product feeds.

1.2 Download

2022-04-29_15-03-01.png

In this stage, you will select how and from where the data for the dataflow is downloaded by specifying a download protocol and providing information needed to access the data files.  

The following protocols are available for download: 

FTP
Download files from FTP by providing following information:

  • FTP host
  • File path where the file can be found
  • Username of the FTP  
  • Password of the FPT

HubSpot
From HubSpot you can download persons with attached properties from your HubSpot account. You need to be "super admin" of your HubSpot account to be able to fetch the required information. Following information is required:

  • API Key
    • Go to your HubSpot account
    • Click the settings icon in top right corner 
    • In the left sidebar menu, navigate to Integrations > API key  
      • If a key has never been generated for your account, click Generate API key
      • If you've already generated an API key, click Show to display your key
        show-API-key.png
    • Copy the key and paste it into the API Key field in the Data Manager.
  • Properties
    • Click the settings icon in top right corner in your HubSpot account
    • In the left sidebar menu, navigate to Data Management > Properties
    • Find the properties you would like to include and click their name
    • In the right hand side panel press the </> icon next to the property name
    • Now copy the internal name and paste it into the property field in the Data Manager
    • Repeat for each property, and separate the properties by commas  
    • Note: The Data Manager will automatically fetch Email, First name and Last name of your HubSpot contacts. If you only need these three properties of you contacts, you can leave the property field in the Data Manager blank 

SFTP
Download files from SFTP by providing following information:

  • SFTP host
  • Port
  • File path where the file can be found

    Date Settings

    When selecting a filepath for the SFTP, FTP or HTTP protocol, you can add a set of curly brackets at the end to set the format of the file's datestamp, and if necessary, a date offset.

    Example: {Today:yyyyMMdd}

    Example with offset: {Today-1:ddMMyyyy}

    The offset may be useful if the data transfer happens at or shortly after midnight, for example.

  • Username of the FTP  
  • Password of the FPT

StreamFile  

The StreamFile option will bind the dataflow to the Raptor Streaming API (click here to go to documentation and endpoint), and rather than requiring an input, it will auto-generate a StreamID when selected. This ID is then input into the Raptor Streaming API header, allowing you to establish a datastream from the Streaming API into the Data Manager.

Website tracking

You can connect your website tracking performed by the Raptor tracking script to the Data Manager and stream data into the CDP by providing following information: 

  • Account ID of your Raptor Account (four or five digits found in your Raptor Account)
  • Start year (YYYY), start month (MM) and start day (DD), which indicates the point in time from when you like to ingest data from your website tracking

HTTP
Download files from an online placement over HTTP(S) by providing following information: 

  • URL of the file 

Regardless of the options chosen, the Data Manager will validate the supplied information before allowing you to proceed. Assuming all data has been entered correctly, the download will proceed, which may take some time depending on the amount of data. When complete, you will be given the option to download and view a local copy of the data – which may prove necessary for troubleshooting purposes. The Data Manager also identifies the format of the supplied file, with three options currently being supported: CSV, JSON and Raw.

1.3 File Mapping

Billede12.png

This stage will generally be completed automatically – once the system detects the format of the supplied file, it will fill in the all the necessary information itself and show the parsed columns on the right. Here, the available columns can be seen, examples of their datapoints can be viewed and names can be changed.

The ‘Mandatory’ checkbox can be used to disallow empty fields in a specific column – ensuring that an error will be reported if any empty values appear there in future runs. If the current file contains empty values for a given column this option cannot be set.

If the file is in a nonstandard format or the system otherwise has issues reading it, you will need to manually edit the parsing-options or select the appropriate format yourself. After making any necessary alterations, select the ‘Parse File’ button to try again. If successful, the columns should appear on the right as usual. Should problems persist at this stage, make sure you have pointed at the right file in the previous stage.

 

1.4 Transform & Map

Billede13.png

In this stage, you transform data and map it to a schema. On the left are the columns you derived from the supplied datafile in the previous step. On the right, you can select one or more schemas which the values of your original data can then be mapped onto. It is also possible to create a new schema to make it tailor-fit for your data. Go to the Schemas & Sources article to see how to create a new schema.

More commonly, you will simply select one of the default or previously-customized schemas available. A specific name can be set for this particular application of the given schema. Once selected, the schema will be displayed on the right, showing the columns it contains and the format they require. Essentially, this provides an easily-accessible blueprint for a dataflow that can be readily understood and employed by Raptor Service’s systems.

Billede14.png

Each line of the schema can be mapped onto a column from the data-source on the left. An ‘auto-map’ option is also available, which will search the file for matching names and formats for each line – though a manual check to make sure everything is pointing where it should afterwards is strongly advised. Note however that all columns are, initially, formatted as simple strings. In order to match up with lines that require a particular format – such as conversion-ready currency-values, e-mail lists, decimals and integers that can be plugged into equations or alike – columns must first be Transformed into the appropriate format.

Aggregation Mode

When selecting a Schema, you can elect to activate Aggregation Mode for it via a simple checkbox, enabling some more complex mapping-options meant for collating information according to various shared details. For more information about this feature, see the Aggregation Mode help-file.

 

1.4.1 Overwrite or Append?

When selecting a Schema of the types PersonData or Catalog, you will be required to select the Write Mode that will be applied to it - Overwrite or Append.

OverwriteAppend.png

Append is the default - with each scheduled run, new data overwrites old data, but any other existing entries are left untouched. This effectively means that 'legacy data' is left untouched - if a product is removed from a catalog, or a customer from a mailing-list, the entry will remain in the dataset, unchanged and untouched by repeated runs of the Dataflow. This can be convenient in some cases, allowing the CDP to take such old data into its calculations, and create continuity in the data if the product or person is later added again... or it might just be useless, irrelevant junk data clogging up the system.

This, then, is when Overwrite may be preferable - if this is selected, the original dataset is completely removed and replaced with the new one each time the Dataflow runs. This prevents outdated lines from accumulating, and is thus the optimal solution for any dataset that is likely to undergo frequent changes - if lines are added and removed with each run, 'leftovers' can accumulate quickly and slow down calculations. Without the Overwrite feature, this would require an onerous and labor-intensive manual cleanup.

The question of whether to use Append or Overwrite thus effectively comes down to two factors: Firstly, is it likely that lines that have been removed from the dataset might stay relevant or become useful again in the future? Second, how often will lines be removed from the dataset - that is, how much 'old data' is it going to generate? If the answer is 'Possible' and 'Not much', you'll want to stick with Append. If it's 'Highly unlikely' and 'Loads', you definitely want Overwrite. Beyond those two edge-cases, though, it's somewhat of a judgement-call...

 

1.4.2 Transform data

Transform data is a flexible feature that can be used to reformat a column, combine multiple columns into a single combined expression, replace certain expressions, or all of the above at the same time. Whatever shape the final data needs to be in, the Transform-feature will get it there. Simply select the ‘+ New transformation’ button at the bottom of the column-listing to perform one.

First, select the relevant column. ONLY select multiple columns if you intend to combine them. A default name will be generated once you’ve made your selection, but it is recommended that you change this to a more appropriate name if you intend to use the auto-map option. Click 'Next step'

In the next screen, a number of different methods can be used to transform the column’s values, ranging from basic formatting or search-and-replace to integrated Regex-commands.

2022-02-25_09-30-59.png

A drop-down in the upper right corner lets you view a selection of samples from the relevant column, helping to give you an idea of what needs doing. If multiple columns were chosen, each can be individually transformed.

 

A note on E-Mail Conversions
When converting a column into the 'Email' format, the option to clear out invalid emails will be presented to you. This can prevent mistyped emails from forcing the removal of an entire line, so long as some other, valid Person ID is present. However, it may be advisible to use the Transform-feature for this, utilizing the 'Replace' transformation to fix some of the more common errors - for example, replacing '.vom' and '.con' with '.com' might salvage several emails that would otherwise have been removed altogether.

 

After selecting the desired transformation(s), hitting the ‘Transform’ button will provide you with a provisionary output, showing the current state of the dataset. The value used for this conversion is drawn from the chosen column and shown at the top, where you can also choose to edit or replace it if you want to watch how the transformations you are applying affect some specific value. Note that once you move past this step, the example-value can no longer be changed.

Applying a transformation is not mandatory – if the data is already in an appropriate shape, you can simply select ‘Skip transformation’ to move on to the Converter.

However, if you selected two or more columns you will first have to select what order they are joined in as well as the methodology by which they are combined. Each option requires one or more inputs, each carefully described for ease of use – be sure to consider which will best serve the purpose of the combination. If you selected only one column, this view will be automatically skipped, sending you directly to the Converter.

In the Converter, you finally select the type of column you want to output in order to match the requirements of the schema. Depending on your choice, various specifics will need to be filled in. While this step cannot be skipped, the ‘String’ option will effectively maintain the same typing that the column started out with, for cases where you simply needed to transform or combine a column. When the appropriate selection has been made and specifics input, hit ‘Convert’ to finalize it. An example will be shown of the final value. At this point, the results can be saved as a custom column with the designated button.

2022-02-25_09-51-55.png

If a column only needs to be converted, an easy shortcut is to mouse over the desired column on the overview, causing a ‘Convert’ option to appear on the right. This will take you directly to the converter.

2022-02-24_14-57-56.png

The results of your transformations will appear at the bottom of the column-list, under the header of ‘Costum Columns’. It is now ready to be mapped to a schema-line of the appropriate format.

 

1.4.3 Map Columns to Schema 

To map a column to your schema, click the blue '+' symbol to the left of the schema line (the + will be transformed to the word "MAP" when hovering over it). 

Billede20.png

This will open a box, where all columns with the same data type as the selected schema line are searched out. Select the right column (in this case, we only have one column with the type 'email'). Select the desired row and press 'Save'.

2022-02-25_10-12-09.png

The result of the mapping can be observed in the box to the right.

2022-02-25_10-17-07.png

In order to finish the mapping, all lines marked as ‘Required’ in the selected schema must me mapped to a column – all the rest are technically optional, but of course it’s generally better to line up as much information as possible. In the 'Downloaded columns' section, those columns that have been matched to the schema will be highlighted with a solid frame, making it easier to spot which remain unmapped. A large number of unmapped lines may suggest that a different schema could work better, or that some customization of the relevant schema is needed. The ‘Quick-check’ option can also be used at this stage, to spot-check for bad matchups, with particular attention paid to any custom columns – potentially saving time, compared to discovering such problems when the dataflow is in production.

1.5 Activate

In this final stage, the scheduling of the dataflow is set. You should consider the nature of your data; is it important to keep it up to date with hourly or otherwise regularly-scheduled runs, or is it enough to run it once per day - or even only this one time all together?

Once the dataflow is scheduled, all that remains is to run a full test using the ‘Test and Activate’ button. The ‘Ignore column conversion errors’ checkbox can be selected if you wish a higher tolerance for mismatches (the system will accept conversion errors in up to 40 % of the total amounts of rows),  

If there are any issues they will appear here, and likely require you to go back through the previous steps to correct them.

Billede21.png

If the issue lies in the original source file, you will be informed of the exact rows that the problem arises from. Such errors will generally require you to correct them right at the source. Other, lesser problems can often be traced down to the previous step. 

Assuming the test is passed, you will be able to select ‘Activate in production’ in order to put the dataflow into use in e.g. the CDP. Alternately, 'Run now and activate in production' can be selected from the same pop-up to put the new dataflow into immediate run-mode, after which point it will fall back on the set schedule.

Related articles: 

Schemas & Sources

The Dataflow Overview

Have more questions? Submit a request

Comments

Powered by Zendesk