Posts filed under ‘General Analytics’

Getting Value from Your Data Series: It’s All About Your Data Ecosystem

thinking guy data

By Marilyn Craig and Mary Ludloff

We’re back with the fourth post in our series on how to get value from your data, including how to ensure that new “data” and “analytics” products are designed for successful delivery to new and existing customers.

In the previous posts in this series, we discussed our methodology and what is required in terms of understanding your target customer—who they are and what they need—as well as making sure you have the right Team in place to work on the project. In this post, we are going to discuss how you build your Data Ecosystem:

  • What is needed to ensure that data processes will support the new product(s)?
  • How do you identify appropriate data partners and enhancements?
  • What privacy- and security-related issues must you be aware of and address?

(more…)

October 12, 2015 at 8:41 am 1 comment

Getting Value From Your Data Series: It’s All About the Team

By Marilyn Craig & Mary Ludloff

monetize people justice leagueWelcome back to the third post in our series on how to get value from your data.
As we stated in a previous post:

“Data, without the proper use of analytics, is meaningless. If data is the new oil, think of analytics as the oil drills—you need both to be successful.”

Of course, getting to “success” is not easy as anyone involved in an analytics project will tell you. This series walks you through our methodology on what it takes—from inception to proof of concept to implementation and deployment—to navigate project pitfalls. However, if you’ve assembled a great team, you will be able to drill for all that oil. In our experience, great teams tend to develop, manage, and sustain successful analytics projects: It all comes down to having the right people with the right skill set. (more…)

March 2, 2015 at 5:21 pm 1 comment

In a pii (Privacy, Identity, Innovation) Conference State of Mind

By Mary Ludloff

pii2014Although this year has been extremely busy for us, Terence and I always find time for this event: The Privacy Identity Innovation Conference.  Natalie Fonseca, the Co-Founder and Executive Producer of it, is the driving force behind its ongoing success. This year’s program focuses on:

“… the latest developments in areas like mobile, biometrics, the Internet of Things and big data. Learn about emerging trends and business models driving the personal information economy, and get guidance on developing strategies and best practices to build trust with your users.” (more…)

November 9, 2014 at 1:33 pm Leave a comment

Getting Value From Your Data Series: Who Is Your Target Customer?

target customer

By Mary Ludloff and Marilyn Craig

Welcome back to the second post in our series on how to get value from your data. As we stated in a previous post:

“Data, without the proper use of analytics, is meaningless. If data is the new oil, think of analytics as the oil drills—you need both to be successful.”

Of course, getting to “success” is not easy as anyone involved in an analytics project will tell you. This series walks you through our methodology on what it takes—from inception to proof of concept to implementation and deployment—to navigate project pitfalls. Now most of us have been involved with great analytics projects that answered no real need. In this post, we take a look at the customer, their pain points, and what benefits they may derive from your analytics project. In other words: Who is your target customer? (more…)

October 15, 2014 at 4:41 pm 1 comment

Getting Value From Your Data Series: The Road May Be Rocky But It’s Well Worth the Effort!

roadwork

By Mary Ludloff and Marilyn Craig

Unless you’ve been asleep for the past couple of years, you, like us, have heard this phrase again and again:  Data is the new oil.  It certainly sounds great but what exactly does it mean? Here’s our take: Getting the most value out of your data can make you better at what you do as well as enable you to do more with what you have. In other words, there’s unrealized value in those data silos that all companies have. But make no mistake: the road to realizing data value is paved with good intentions and often times, poor execution and results.

oil drillsToday, most companies are drowning in data—there’s historical data from operations, data from public sources, data from partners and acquisitions, data you can purchase from data brokers, etc.  These companies have read all the research and want to leverage their data assets to make “better” operational decisions, to offer their existing customer base more insights, to pursue new revenue opportunities. Of course, the real value in that data is derived from the business analytics that deliver the insights that drive better decisions. As we’ve said quite often on this blog: Data, without the proper use of analytics, is meaningless. If data is the new oil, think of analytics as the oil drills—you need both to be successful. (more…)

September 30, 2014 at 4:06 pm 4 comments

Lessons Learned from the Google Flu Tracker—Why We Need More Than Just Data

By Marilyn Craig and Mary Ludloff

God we trustWe read an interesting paper and post about Google Flu Trends (GFT) and its foibles last week. The paper points out a couple of lessons that those of us living in the big data analytics world have learned the hard way but the dangers are worth revisiting as tools like ours (AnalyticsPBI for Azure) begin to move big data analytics into the mainstream of organizational practices. After all, our tool (and others like it) makes it easy and even fun for analytics junkies to use all those available zettabytes of data and answer questions that they’ve long wondered about. But the paper also reminded us of the dangers of ignoring the natural cycles of an analytics process that we talked about in this recent post. If Google followed the PatternBuilders Analytics Methodology, they might have avoided many of the errors that GFT is now spitting out. In fact, the authors of the paper point out that:

“Although not widely reported until 2013, the new GFT has been persistently overestimating flu prevalence for a much longer time. GFT also missed by a very large margin in the 2011-2012 flu season and has missed high for 100 out of 108 weeks starting with August 2011… This pattern means that GFT overlooks considerable information that could be extracted by traditional statistical methods.”

This overestimation is attributed to two primary factors: data hubris and algorithm dynamics. (more…)

April 9, 2014 at 6:15 pm Leave a comment

Reporting, Analytics, and Big Data: A Continuous Feedback Loop to Drive Better Decision-Making

By Marilyn Craig

LoopA recent conversation with a client reminded me that no matter how crazy and exciting the Big Data world gets, it is still critical to understand what your goals are and where you are in the process of reaching those goals. Having a good foundation in “what’s important” is critical before you jump into the wild world of Big Analytics.

For example, in big data (well, actually all data but I digress) “Reporting” and “Analytics” are very different functions. But I often find our customers and prospects grappling with how to distinguish one from the other and as a result, confusing reporting with analysis and losing track of their real goals.

(more…)

March 2, 2014 at 3:07 pm 3 comments

Events to Measures – Scalable Analytics Calculations using PatternBuilders in the Cloud

By Terence Craig

SMEventToMeasureTopDiagramOne part of the secret sauce that enables PatternBuilders to provide more accessible and performant user experiences for both creators and consumers of streaming analytics models is its infrastructure. Our infrastructure makes it easy to combine rich search capabilities for a diverse set of standard analytics that can be used to create more complex streaming analytics models. This post will describe how we create those standard analytics that we call Measures.

In my last post about our architecture, we delved into how we used custom SignalReaders as the point of entry for data into Analytics PBI.  We’ve tightened up our nomenclature a bit since our last post, so it’s worth reviewing some of our definitions:

Nomenclature Description
Feed An external source of data to be analyzed.  These can include truly real-time feeds such as stock-tickers, the Twitter firehose, or batch feeds, such as CSV files converted to data streams.
Event An external event within a Feed that analysis will be performed on. For example, a stock tick, RFID read, PBI performance event, tweet, etc.  AnalyticsPBI can support analysis on any type of event as long as it has one or more named numeric fields and a date.  An Event can have multiple Signals.
Signal A single numeric data element within an Event, tagged with the metadata that accompanied the Event, plus any additional metadata (to use NSA parlance) applied by the FeedReader. For example, a stock tick would have Signals of Price and Volume among others.
Tag A string representing a piece of metadata about an Event.  Tags are combined to form Indexes for both Events and Measures.
FeedReader (formerly SignalReader) A service written by PatternBuilders, customers, or third parties to read particular Feed(s), convert the metadata to Tags, and potentially add metadata from other sources to create Events.  Simple examples include a CSV reader and a stock tick reader. An example of a more complex reader is the reader we have created for the University of Sydney project that filters the Twitter firehose for mentions of specific stock symbols and hyperlinks to major media articles and then creates an Event that includes a Signal derived from the sentiment scores of those linked articles.  That reader was discussed here.A FeedReader’s primary responsibility is to create and index an object that converts “raw data” received from one or more Feeds to an Event. To accomplish this it does the following:

  1. Captures an Event from a feed – stock ticker, RFID channel, the Twitter firehose, etc.
  2. Uses the Event itself and any appropriate external data to attach or enrich metadata and numeric data to the Event.
  3. Creates a MasterIndex from all of the metadata attached to the Event. This MasterIndex and the Date associated with this Event is used to create Measures and Models later on in the process.  It can also attach geo data if appropriate.
  4. Extracts the numeric Signals for that Event.
  5. Pushes the Event object onto a named queue – the “EventToBeCalculatedQueue”–for processing. This queue, like all PatternBuilders queues, has a pluggable implementation. It can be in memory (cheaper, and faster) or persistent (more costly and slightly slower). One of the great advantages of the various cloud services, including our reference platform Azure, is the availability of scalable, fast, reliable, persistent queues.
Measure A basic calculation that is generated automatically by the PatternBuilders calculation service and persisted. Measures are useful in and of themselves but they are also used to dynamically  generate results for more complex streaming Analytic Models.

As the topic of this post is Events to Measures, let’s create a simple Measure and follow it thru the process. For this purpose, we’ll be working with a simplified StockFeedReader that will create a tick Event from a tick feed that includes two Signals – Volume and Price – for stock symbols on a minute-by-minute basis. The reader will enrich the Feed’s raw tick data with metadata about the company’s industries and locations. After enrichment, the JSON version of the event would look like this:

{
     "Feed": "SampleStockTicker",
     "FeedGranularity": "Minute",
     "EventDate": "Fri, 23 Aug 2013 09:13:32 GMT",
     "MasterIndex": "AcmeSoftware:FTSE:Services:Technology",
     "Locations":  [
          {
              "Americas Sales Office": {
                  "Lat": "40.65",
                  "Long": "73.94"
               }
          }
          {
               "Europe Sales Office": {
                  "Lat": "51.51",
                  "Long": "0.12"
               }
          }
      ],
      "Tags":  [
          {
              "Tag1": "AcmeSoftware",
              "Tag2": "Technology",
              "Tag3": "FTSE"
          }
       ],
       "Signals":  [
          {
               "Price": "20.00",
               "Volume": "10000"
          }
       ]
}

Note that there is a MasterIndex field that is a concatenation of all the Tags about the tick. When the MasterIndex is persisted, it is actually stored in a more space efficient format but we will use the canonical form of the index as shown above throughout this post for clarity.

A MasterIndex has two purposes in life:

  1. To allow the user to easily find a Signal by searching for particular Tags.
  2. To act as the seed for creating indexes for Measures and Models. These indexes, along with a date range, are all that is required to find any analytic calculations in the system.

Once an Event has been created by a FeedReader, the FeedReader uses an API call to place the Event on the EventToBeCalculatedQueue. Based on beta feedback, we’ve adopted a pluggable queuing strategy. So before we go any further, let’s take a quick detour and talk briefly about what that means.  Currently, PatternBuilders supports three types of queues for Events:

  • A pure in-memory queue. This is ideal for customers that want the highest performance and the lowest cost and who are willing to redo calculations in the unlikely event of machine failure. To keep failure risk as low as possible, we actually replicate the queues on different machines and optionally, place those machines in different datacenters.
  • Cloud-based queues. Currently, we use Azure ServiceBus Queues but there is no reason that we couldn’t also support other PaSS vendor’s queues as well. The nice thing about ServiceBus queues is that the latest update from Microsoft for Windows 2012 allows them to be used on-premise against Windows Server with the same code as for the cloud—giving our customers maximum deployment flexibility.
  • AMPQ protocol. This allows our customers to host FeedReaders and Event queues completely on-premise while using our calculation engine.  When combined with encrypted Tags, this allows our customers to keep their secrets “secret” and still enjoy the benefits of a real-time cloud analytics infrastructure.

Once the Event is placed on the IndexRequestQueue, it will be picked up by the first available Indexing server which monitors that queue for new Events (all queues and Indexing servers can be scaled up or down dynamically). The indexing service is responsible for creating measure indexes from the Tags associated with the Event.  This is the most performance critical part of loading data so forgive our skimpiness on implementation details but we are going to let our competition design this one for themselves :-).  Let’s just say that conceptually the index service creates a text search searchable index for all non-alias tags and any associated geo data. Some tags are simply aliases for other Tags and do not need measures created for them. For example, the symbol AAPL is simply and alternative for Apple Computer, so creating an average volume metric for both APPL and Apple is pointless since they will always be the same. Being able to find that value by searching on APPL or Apple on the other hand is amazingly useful and is fully supported by the system.

More formally:

<Geek warning on>

The indexes produced by an Event will be:

image001

where n equals the number of non-alias tags and the upper limit for k is equal to n.

</Geek warning off>

From our simple example above, we have the following Tags: AcmeSoftware, FTSE, Services, and Technology.  This trivial example will produce the following Indexes:

AcmeSoftware
FTSE
Services
Technology
AcmeSoftware:FTSE
AcmeSoftware:Services
AcmeSoftware:Technology
FTSE:Services
FTSE:Technology
Services:Technology
AcmeSoftware:FTSE:Services
AcmeSoftware:FTSE:Technology
AcmeSoftware:Services:Technology
FTSE:Services:Technology
AcmeSoftware:FTSE:Services:Technology

The indexing service can perform parallel index creation across multiples cores and/or machines if needed. As Indexes are created, they, and each Signal in the Event, are combined into a calculation request object and placed in the MeasureCalculationRequestQueue queue that is monitored by the Measure Calculation Service.

The analytics service will take each index and use it to create/update all of the standard measures (Sum, Count, Avg, Standard Deviation, Last, etc.) for each unique combination of index and the Measure’s native granularity for each Signal (Granularity management is complex and will be discussed in my next post).

Specifically, the Calculation Service will remove a calculation request object from the queue and perform the following steps for all Measures appropriate to the Signal:

  1. Attempt to retrieve the Measure from either cache or persistent storage.
  2. If not found, create the Measure for the appropriate Date and Signal.
  3. Perform the associated calculation and update the Measure.

Graphically the whole process looks something like this:

SManalyticsservice

The advantages of this approach are manifold.  First, it allows for very sophisticated search capabilities across Measures and Models.  Second, it allows deep parallelization for Measure calculation. This parallelization allows us to scale the system by creating more Indexing Services and Calculation Services with no risk of contention and it is this scalability which allows us to provide near real-time, streaming updates for all Measures and most Models.  Each Index, time, and measure combination is unique and can be calculated by separate threads or even separate machines. A measure can be aggregated up from its native granularity using a pyramid scheme if the user requests it (say by querying for an annual number from a measure whose Signal has a native granularity of a minute). A proprietary algorithm prevents double counting for the edge cases where Measures with different Indexes are calculated from the same Events.

So now you’ve seen how we get from a raw stream to a Measure.  And how, along the way, we’re able to enrich meta and numeric data to enable both richer search capabilities and easier computation of more complex analytics models.  Later on, we explore how searches are performed and models are developed—you will see how this enrichment process makes exploring and creating complex analytics models much easier than the first generation of  big data, business intelligence, or desktop analytics systems.

However, before we get there we need to talk about how PatternBuilders handles dates and Granularity in more detail.  At our core, we are optimized for time-series analytics and how we deal with time is a critical part of our infrastructure. This is why in my next post we will be doing a deep (ok medium deep) dive into how we handle pyramidal aggregation and the always slippery concepts of time and streaming data. Thanks for reading and as always comments are free and welcomed!

August 29, 2013 at 8:18 am 2 comments

pii2013: Building Trust in the Data Driven Economy—Hope to see you there!

By Terence Craig

pii2013As entrepreneurs at a growing startup there are very few things that are exciting enough to divert even a tiny bit of our attention from giving our customers the world’s best streaming analytics technology.  And while my co-founder Mary and I have been known to disagree on what those things might be, we are always in agreement that the Privacy Identity Innovation Conferences (pii) are the best conferences for bringing together leading voices from technology, science, and government for the critical discussion(s) of what Privacy and Identity mean in the age of the NSA, Facebook, and Internet  of things. pii2013 is being held in Seattle this year to (as their website states):

“Explore emerging technologies and business models, and highlight strategies and best practices for building trust with users. From news reports of increasing government surveillance to stories about startups using customer data in ‘surprising’ ways, there’s no shortage of examples illustrating why now is an important time to talk about innovation and trust. It’s a critical conversation about the future of privacy, identity and reputation that you won’t want to miss.” (more…)

August 12, 2013 at 2:18 pm Leave a comment

Privacy v Security, Transparency v Secrecy: The NSA, PRISM, and the Release of Classified Documents

By Mary Ludloff

Privacy, Anonymity, and Judicial Oversight are on the Endangered List

PRISM 3An age old debate has once again reared its very ugly head due to whistleblower Edward Snowden’s revelations about NSA surveillance, PRISM, and the astounding lack of any rigorous oversight on the NSA’s vast data collection apparatus.  While PatternBuilders has been incredibly busy, in our non-copious amounts of spare time Terence and I have also been working on our update to Privacy and Big Data (which is undergoing another rewrite due to new government surveillance revelations that for a while happened hourly, then daily, then weekly but certainly are far from over). It’s important to note that pre-revelations  our  task was already herculean due to mainstream media’s pick up on “all stories related to privacy” (a good thing) that often missed the mark on the technical side of the house (we often find ourselves explaining to non-techies just what meta data is which usually happens after someone on CNN, Fox, NBC, ABC, etc., butchers the definition) or got tripped up by the various Acts, Amendments, state laws, EU Directives, etc., that apply to aspects of privacy.

Over the last few weeks as details about PRISM emerged, it’s become clear to me that main street America may still not understand the seismic shift that big data and analytics brings to the privacy debate. Certainly the power of big data and analytics has been lauded or vilified in the press—followers of our twitter feed are used to seeing the pros and cons of big data projects debated pretty much every day. We’ve (Terence and I) talked and tweeted about privacy issues as it applies to individuals, companies, and governments. Heck, we even wrote a book about privacy and big data. (more…)

July 19, 2013 at 12:14 pm 3 comments

Older Posts


Video: Big Data Made Easy

PatternBuilders Corporate

Special privacy section!

Previous Posts


%d bloggers like this: