Posts tagged ‘Microsoft’
Events to Measures – Scalable Analytics Calculations using PatternBuilders in the Cloud
One part of the secret sauce that enables PatternBuilders to provide more accessible and performant user experiences for both creators and consumers of streaming analytics models is its infrastructure. Our infrastructure makes it easy to combine rich search capabilities for a diverse set of standard analytics that can be used to create more complex streaming analytics models. This post will describe how we create those standard analytics that we call Measures.
In my last post about our architecture, we delved into how we used custom SignalReaders as the point of entry for data into Analytics PBI. We’ve tightened up our nomenclature a bit since our last post, so it’s worth reviewing some of our definitions:
Nomenclature | Description |
Feed | An external source of data to be analyzed. These can include truly real-time feeds such as stock-tickers, the Twitter firehose, or batch feeds, such as CSV files converted to data streams. |
Event | An external event within a Feed that analysis will be performed on. For example, a stock tick, RFID read, PBI performance event, tweet, etc. AnalyticsPBI can support analysis on any type of event as long as it has one or more named numeric fields and a date. An Event can have multiple Signals. |
Signal | A single numeric data element within an Event, tagged with the metadata that accompanied the Event, plus any additional metadata (to use NSA parlance) applied by the FeedReader. For example, a stock tick would have Signals of Price and Volume among others. |
Tag | A string representing a piece of metadata about an Event. Tags are combined to form Indexes for both Events and Measures. |
FeedReader (formerly SignalReader) | A service written by PatternBuilders, customers, or third parties to read particular Feed(s), convert the metadata to Tags, and potentially add metadata from other sources to create Events. Simple examples include a CSV reader and a stock tick reader. An example of a more complex reader is the reader we have created for the University of Sydney project that filters the Twitter firehose for mentions of specific stock symbols and hyperlinks to major media articles and then creates an Event that includes a Signal derived from the sentiment scores of those linked articles. That reader was discussed here.A FeedReader’s primary responsibility is to create and index an object that converts “raw data” received from one or more Feeds to an Event. To accomplish this it does the following:
|
Measure | A basic calculation that is generated automatically by the PatternBuilders calculation service and persisted. Measures are useful in and of themselves but they are also used to dynamically generate results for more complex streaming Analytic Models. |
As the topic of this post is Events to Measures, let’s create a simple Measure and follow it thru the process. For this purpose, we’ll be working with a simplified StockFeedReader that will create a tick Event from a tick feed that includes two Signals – Volume and Price – for stock symbols on a minute-by-minute basis. The reader will enrich the Feed’s raw tick data with metadata about the company’s industries and locations. After enrichment, the JSON version of the event would look like this:
{ "Feed": "SampleStockTicker", "FeedGranularity": "Minute", "EventDate": "Fri, 23 Aug 2013 09:13:32 GMT", "MasterIndex": "AcmeSoftware:FTSE:Services:Technology", "Locations": [ { "Americas Sales Office": { "Lat": "40.65", "Long": "73.94" } } { "Europe Sales Office": { "Lat": "51.51", "Long": "0.12" } } ], "Tags": [ { "Tag1": "AcmeSoftware", "Tag2": "Technology", "Tag3": "FTSE" } ], "Signals": [ { "Price": "20.00", "Volume": "10000" } ] }
Note that there is a MasterIndex field that is a concatenation of all the Tags about the tick. When the MasterIndex is persisted, it is actually stored in a more space efficient format but we will use the canonical form of the index as shown above throughout this post for clarity.
A MasterIndex has two purposes in life:
- To allow the user to easily find a Signal by searching for particular Tags.
- To act as the seed for creating indexes for Measures and Models. These indexes, along with a date range, are all that is required to find any analytic calculations in the system.
Once an Event has been created by a FeedReader, the FeedReader uses an API call to place the Event on the EventToBeCalculatedQueue. Based on beta feedback, we’ve adopted a pluggable queuing strategy. So before we go any further, let’s take a quick detour and talk briefly about what that means. Currently, PatternBuilders supports three types of queues for Events:
- A pure in-memory queue. This is ideal for customers that want the highest performance and the lowest cost and who are willing to redo calculations in the unlikely event of machine failure. To keep failure risk as low as possible, we actually replicate the queues on different machines and optionally, place those machines in different datacenters.
- Cloud-based queues. Currently, we use Azure ServiceBus Queues but there is no reason that we couldn’t also support other PaSS vendor’s queues as well. The nice thing about ServiceBus queues is that the latest update from Microsoft for Windows 2012 allows them to be used on-premise against Windows Server with the same code as for the cloud—giving our customers maximum deployment flexibility.
- AMPQ protocol. This allows our customers to host FeedReaders and Event queues completely on-premise while using our calculation engine. When combined with encrypted Tags, this allows our customers to keep their secrets “secret” and still enjoy the benefits of a real-time cloud analytics infrastructure.
Once the Event is placed on the IndexRequestQueue, it will be picked up by the first available Indexing server which monitors that queue for new Events (all queues and Indexing servers can be scaled up or down dynamically). The indexing service is responsible for creating measure indexes from the Tags associated with the Event. This is the most performance critical part of loading data so forgive our skimpiness on implementation details but we are going to let our competition design this one for themselves :-). Let’s just say that conceptually the index service creates a text search searchable index for all non-alias tags and any associated geo data. Some tags are simply aliases for other Tags and do not need measures created for them. For example, the symbol AAPL is simply and alternative for Apple Computer, so creating an average volume metric for both APPL and Apple is pointless since they will always be the same. Being able to find that value by searching on APPL or Apple on the other hand is amazingly useful and is fully supported by the system.
More formally:
<Geek warning on>
The indexes produced by an Event will be:
where n equals the number of non-alias tags and the upper limit for k is equal to n.
</Geek warning off>
From our simple example above, we have the following Tags: AcmeSoftware, FTSE, Services, and Technology. This trivial example will produce the following Indexes:
AcmeSoftware
FTSE
Services
Technology
AcmeSoftware:FTSE
AcmeSoftware:Services
AcmeSoftware:Technology
FTSE:Services
FTSE:Technology
Services:Technology
AcmeSoftware:FTSE:Services
AcmeSoftware:FTSE:Technology
AcmeSoftware:Services:Technology
FTSE:Services:Technology
AcmeSoftware:FTSE:Services:Technology
The indexing service can perform parallel index creation across multiples cores and/or machines if needed. As Indexes are created, they, and each Signal in the Event, are combined into a calculation request object and placed in the MeasureCalculationRequestQueue queue that is monitored by the Measure Calculation Service.
The analytics service will take each index and use it to create/update all of the standard measures (Sum, Count, Avg, Standard Deviation, Last, etc.) for each unique combination of index and the Measure’s native granularity for each Signal (Granularity management is complex and will be discussed in my next post).
Specifically, the Calculation Service will remove a calculation request object from the queue and perform the following steps for all Measures appropriate to the Signal:
- Attempt to retrieve the Measure from either cache or persistent storage.
- If not found, create the Measure for the appropriate Date and Signal.
- Perform the associated calculation and update the Measure.
Graphically the whole process looks something like this:
The advantages of this approach are manifold. First, it allows for very sophisticated search capabilities across Measures and Models. Second, it allows deep parallelization for Measure calculation. This parallelization allows us to scale the system by creating more Indexing Services and Calculation Services with no risk of contention and it is this scalability which allows us to provide near real-time, streaming updates for all Measures and most Models. Each Index, time, and measure combination is unique and can be calculated by separate threads or even separate machines. A measure can be aggregated up from its native granularity using a pyramid scheme if the user requests it (say by querying for an annual number from a measure whose Signal has a native granularity of a minute). A proprietary algorithm prevents double counting for the edge cases where Measures with different Indexes are calculated from the same Events.
So now you’ve seen how we get from a raw stream to a Measure. And how, along the way, we’re able to enrich meta and numeric data to enable both richer search capabilities and easier computation of more complex analytics models. Later on, we explore how searches are performed and models are developed—you will see how this enrichment process makes exploring and creating complex analytics models much easier than the first generation of big data, business intelligence, or desktop analytics systems.
However, before we get there we need to talk about how PatternBuilders handles dates and Granularity in more detail. At our core, we are optimized for time-series analytics and how we deal with time is a critical part of our infrastructure. This is why in my next post we will be doing a deep (ok medium deep) dive into how we handle pyramidal aggregation and the always slippery concepts of time and streaming data. Thanks for reading and as always comments are free and welcomed!
Enterprise Software in the Cloud: Why We Chose Azure as our First PaaS Platform
I’ve been absent from the blog too long, but if you’ve been following my colleagues (Mary and Marilyn) postings, you’ll see it’s been a very busy and fruitful time at PatternBuilders. While I’m still overdue for the next segment of the architecture blog series, I thought I would take a break and talk a bit about some of the things we learned as we moved our product and business model to Microsoft Azure.
As someone who has worked with Microsoft technology and partnered with them off and on over the last two decades (even flirting with going to work for them a couple of times), the most surprising discovery was how serious Microsoft has become about the cloud, open source, and being an active and supportive partner for startups. As many of you who have been around as long as I have will no doubt remember, this is a very different, some would say revolutionary, move for the world’s most powerful proprietary software company. We had some concerns when we became members of Microsoft’s Azure Startup program BizSpark Plus and subsequently the more exclusive BizSpark One, but it has turned out to be a great experience for us on both the business and technical level. (more…)
AnalyticsPBI for Azure: Turning Real-Time Signals into Real-Time Analytics
For the second post on AnalyticsPBI for Azure (first one here), I thought I would give you some insight on what is required for a modern real-time analytics application and talk about the architecture and process that is used to bring data into AnalyticsPBI and create analytics from them. Then we will do a series of posts on retrieving data. This is a fairly technical post so if your eyes start to glaze over, you have been warned.
In a world that is quickly moving towards the Internet of Things, the need for real-time analysis of high velocity and high volume data has never been more pronounced. Real-time analytics (aka streaming analytics) is all about performing analytic calculations on signals extracted from a data stream as they arrive—for example, a stock tick, RFID read, location ping, blood pressure measurement, clickstream data from a game, etc. The one guaranteed component of any signal is time (the time it was measured and/or the time it was delivered). So any real-time analytics package must make time and time aggregations first class citizens in their architecture. This time-centric approach provides a huge number of opportunities for performance optimizations. It amazes me that people still try to build real-time analytics products without taking advantage of them.
Until AnalyticsPBI, real-time analytics were only available if you built a huge infrastructure yourself (for example, Wal-Mart) or purchased a very expensive solution from a hardware-centric vendor (whose primary focus was serving the needs of the financial services industry). The reason that the current poster children for big data (in terms of marketing spend at least), the Hadoop vendors, are “just” starting their first forays into adding support for streaming data (see CloudEra’s Impala, for example) is that calculating analytics in real-time is very difficult to do. Period.
Introducing AnalyticsPBI for Azure—A Cloud-Centric, Components-Based, Streaming Analytics Product
It has been a while since I’ve done posts that focus on our technology (and big data tech in general). We are now about 2 months out from the launch of the Azure version
of our analytics application, AnalyticsPBI, so it is the perfect time to write some detailed posts about our new features. Consider this the first in the series.
But before I start exercising my inner geek, it probably makes sense to take a look at the development philosophy and history that forms the basis of our upcoming release. Historically, we delivered our products in one of two ways:
- As a framework which morphed (as of release 2.0) into AnalyticsPBI, our general analytics application designed for business users, quants, and analysts across industries.
- As vertical applications (customized on top of AnalyticsPBI) for specific industries (like FinancePBI and our original Retail Analytics application) which we sold directly to companies in those industries.
Privacy and Big Data: Speaking at Strata East (NYC), Book Update, and Upcoming O’Reilly Webcast
There are times when Terence and I look at each other and say, “What on earth were we thinking?” And this is one of those times! PatternBuilders is crazy busy right now putting out release 3.0 of our Analytics Platform (the secret sauce for our analytics applications that we like to call data-science-in-a-box), ramping up on a funding round, working with partners on a University of Sydney research project on the impact of social media on a company’s stock price (a really fun project and a post about it is in the works), and, of course, supporting customers and prospects on their big data initiatives. So… since we did not have enough to do (sarcasm on), we decided it was time to update our book, participate in a pre-Strata East webcast, speak at the Strata Conference and the MongoDB User Group (that is collocated with Strata) in New York City! In the words of the immortal Bette Davis in All About Eve (and ever so slightly revised):
“Fasten your seat belts, it’s going to be a bumpy night ride!”
Really, what were we thinking????? (more…)
Weekly Roundup: Privacy, Security, Amazon Reviews, Infographic Resumes, and the Comma!
Folks, I am neck deep in writing “stuff” this week (from my final McKinsey health care post to working with Terence on another chapter for our upcoming Ebook—yep, shameless plugs strike again!) but so many great posts and articles came through my “inbox” this week that I just have to “talk” about them. If you have some time over this long weekend, every single one of these items is worth a thorough read.
Privacy is Every Where and No Where
One of the most thoughtful posts on privacy in the digital world, courtesy of John Jordan, came out today. John’s use of real world examples to illustrate his own angst on the topic made me stop and think:
“Does it matter that a person’s political alignment, sexual orientation, religious affiliation, and zip code (a reasonable proxy for household income) are now a matter of public, searchable record? Is her identity different now that some many facets of it are transparent? Or is it a matter of Mark Zukerberg’s vision—people have one identity, and transparency is good for relationships—being implicitly shared more widely across the planet? Just today, a review of Google Plus argued that people don’t mind having one big list of “friends,” even as Facebook scored poorly on this year’s customer satisfaction index.” (more…)
McKinsey Study: Big Data & Analytics, Talent, and the “Brand”
This has been a very busy month for PatternBuilders! Our engineering team is busily testing our Social Media Analytics (SMA) solution, we are looking for companies (or folks) that have a strong social media presence (translation: lots of data feeds and lots of data to work with) that would like to beta it, speaking at pii2011 and MongoSF, and in our copious amounts of spare time, Terence and I are working on our Ebook on Privacy and Big Data (yep, still plugging!). For a sneak peek at SMA, visit our beta page and if you’re interested, sign up for the beta.
Now, on to McKinsey’s study. As you all know, in a previous post I mentioned that McKinsey had just released its study on “Big Data: The Next Frontier for Innovation, Competition, and Productivity.” Weighing in at a mere (!) 156 pages, it is a deep dive into the potential of big data and analytics across five key areas:
- Health care in the United States (potential annual value estimated at $300 billion).
- Public sector administration in Europe (potential annual value estimated at €250 billion—more than the GDP of Greece).
- Retail in the United States (operating margins estimated to increase by 60%—for an industry that operates on razor thin margins, this is huge).
- Global manufacturing (increased efficiencies in design and production, more targeted products, and more effective promotion and distribution).
- Collection and analysis of personal location data (estimated at more than $100 billion in service provider revenue and more than $700 billion in value to consumers and business users).
Whither Mono?
The rumors have been flying hot and heavy about the future of Mono, the portable version of Microsoft’s .NET platform since Novell has been purchased by Attachmate. While Mono was, and remains, open source, its development was driven primarily by a team at Novell led by the brilliant hacker Miguel De Icaza. The gist of the rumors seems to indicate that some/all of the Mono developers have been laid off and that further development of Mono by Novell/Attachmate is in jeopardy. If either is true, it is a shame.
Mono never got the support it deserved from Microsoft, who never seemed to appreciate its potential to increase Windows legitimacy and presence in the server room by increasing the number of folks using the .NET framework. This lack of support, combined with some concerns about certain patents that Microsoft has that could be problematic for OS vendors who bundled Mono, dramatically slowed its adoption rate. And although the Java programming language left “the window open,” falling behind Mono/.NET technically due to political battles about its direction, Mono was unable to gain ground. (more…)
Data Security and You: There’s Got to Be a Better Way
A non-rant rant on data security.
Have you ever had one of those days when you throw up your hands and simply say, “There’s got to be a better way!” Well, this is one of those days. Recently, Jenn Webb, in an O’Reilly Radar piece, asked the following:
“How much convenience are you willing to give up for security?”
Webb was talking about Google’s 2-step verification process (I remembered reading about this a couple of months ago) which essentially “jumps” the user through a number of “hoops” to ensure more secure access to Google applications. I ended my comment on the article with the following: “Google, could you have made this any more difficult for people operating in the real world to use?” And once I clicked Submit, I thought I was done. Nope. The more I thought about this, the more I felt a rant coming on. I mean, really, how hard is it for companies like Google (and many others) to come up with a user-friendly way to ensure secure access? They certainly have the money to do it and by all accounts, they definitely have the engineering talent to do it. So what’s the problem? (more…)