Exploratory Data Analysis of the C12.22 Protocol for Smart Metering

ANSI C12.22 is an ANSI/IEEE standard that describes the operation of end devices, such as smart meters, over any LAN/WAN network. Together, these smart meters and LAN/WAN networks compose what is known as the Advanced Metering Infrastructure (AMI), a vital construct of the universally known smart grid. Available by advances in several technologies, the smart grid promises several benefits. It promises to ensure greater reliability, to maintain affordability, to introduce several advancements and efficiencies, and to accommodate both renewable and traditional energy sources (Department of Energy).

One example of an AMI consumer benefit is the consumer-friendly “Prices-to-Devices” near real time energy market pricing. Because price signals can be relayed directly to the smart meters, changes in consumer pricing are held in almost real time. This allows consumers to take advantage of lower night rates offered by utility providers who calculate their energy pricing using peak periods. Although the benefits of the new smart grid technology are numerous, the exposure and use of the universal Internet Protocol for communication, carries with it several of its traditional IT vulnerabilities.

Unlike some legacy ICS protocols that send their data in the clear, ANSI C12.22 does include AES encryption. So a question that we ask is, what does the unencrypted sections of the C12.22 protocol tells us about the information going through the network? More specifically, can ANSI C12.22 be scrutinized to give would be attackers a critical understanding of the encrypted AMI data? The preliminary approach chosen to answer these questions is by composing feature vectors of ANSI C12.22 traffic that will allow us to apply unsupervised machine learning.

To begin our analysis, our group has worked on analyzing traces of ANSI C12.22 using the common network protocol analyzer, Wireshark. Study of this protocol has revealed several features that are not encrypted and can be used for analysis. Features such as the called-AP-title and the calling-AP-title for example, can be used to identify and match a corresponding request message with its succeeding reply message. Furthermore, there are several other features that can be used, such as the Application Payload Data Unit (APDU) size, time between messages, and some informational bits called the EPSEM flags.

 Using a combination of features mentioned above, our group has derived a feature vector that includes the Request APDU size, Reply APDU size, delta t (time between request and reply messages), and a flow bit. The flow bit is used as a binary representation of the location where a message was initiated. In order to build these feature vectors all relevant data was filtered and exported from Wireshark. Considering that our feature vector is a construct of a request with its corresponding reply, a Python script was developed to read through the Wireshark output and compose the correctly formatted feature vectors.

Several methods have been proposed and are currently being pursued for applying unsupervised machine learning to these vectors. One approach is using a method called hierarchical clustering. Hierarchical clustering is a method of cluster analysis that essentially builds nests of clusters by merging them successively. For example, a set of feature vectors can be used to create a point representing those particular messages in space. These data points are then merged by successively choosing the closest points over several iterations. This particular method of clustering constructs a divisive or “top down” hierarchy. This would allow us to study those naturally formed clusters and determine if at any hierarchy there is a cluster correlation to any particular type of messages; thus giving us some new awareness of the traffic.

It is still too early to suggest a conclusion on this study. Our research group continues to analyze the ASMI C12.22 traffic and apply different machine learning methods. With the ever-increasing cyber attacks happening world-round, cyber-physical systems are now viable targets for anyone with the right skills and motive. Considering that the AMI network provides critical access points for disrupting electrical power to end users, any weakness on this system would be detrimental.    

The quest for the best – Creating Hive Plots

Recently I acquired a large data set of Twitter users to analyze, originally used in a report by Yang, C., Harkreader, R., Zhang, J., Shin, S., & Gu, G. called  Analyzing spammers’ social networks for fun and profit: a case study of cyber criminal ecosystem on twitter [1]. In short, the authors analyzed the network patterns of malicious users within their small world networks, identifying the traits these users had.

I wondered whether I could use a visual approach to answer a similar question: How can the network structures that identify malicious users as they appear on Twitter be detected using a visual methodology? What can visualizing networks using methods alternative to the traditional node-link approach reveal about the underlying network structure? I began my search for an appropriate visualization at hive plots, which I post about here.

I unearthed a few tools mentioned on www.hiveplot.net, including Jhive, a nifty Java application for visualizing hive plots, and D3.js, a javascript library that can be applied to visualize hive plots. Jhive was my first choice in beginning to unearth the patterns within the networks of spamming Twitter accounts.

First, I wrote a small program in Java to parse through the source files and format into various data storage languages, including DOT and JSON. However, I after loading the files in to JHive, I ran into a scalability issue. The application couldn’t handle the large about of data I was passing through to be parsed into Jhive. Only one small portion of the graph managed to render on the screen at a time. Perhaps the source code of Jhive is not yet fleshed out enough to handle such large files with ease. Nevertheless, Jhive is the most robust tool in terms of the amount of possibilities that exist in choosing different features while visualizing hive plots.

I moved on to a language I was more comfortable with , JavaScript (D3.js). Still, even after loading the file into an external JSON file the browser couldn’t handle the drawing of more than a few thousand nodes and their links. Similarly to the jhive application, the responsiveness of the data was pathetic. So I decided to cut my loses and dive into a completely new language, R. More on that later.

Sources: 1. Yang, C., Harkreader, R., Zhang, J., Shin, S., & Gu, G. (2012, April). Analyzing spammers’ social networks for fun and profit: a case study of cyber criminal ecosystem on twitter. In Proceedings of the 21st international conference on World Wide Web (pp. 71-80). ACM.

High Res Data Graphics and Big Data

The human eye can make differentiate at 0.1 mm, or 254 distinctions within an inch. Within The Visual Display of Quantitative Information, Tufte discusses how well data graphics can utilize the eye’s capability through data density.

Data density, as defined by Tufte, is the number of data entries within a graphic divided by the area of the display, or the amount of data per unit of area within the graphic. Tufte asserts Graphics with high density can allow for greater interaction, facilitating user’s ability to personalize data while data with low density convey a sense of low credibility in the source. Furthermore, low density displays provoke these questions: “what are they leaving out?”, “Has this data been cherry picked?”

Utilizing high density display methods becomes almost a necessity for representing Big Data. Information graphics representing these incredibly large data sets may easily become too dense. However, Tufte provides suggestions to properly represent the underlying data in a smaller amount of space:

  1. Summary/ data reduction – averaging, clustering, smoothing
  2. The Shrink Principle – “Graphics can be shrunk way down.” Decrease the size of data graphics massively. Many can be shrunk by half without any loss of legibility.
  3. Small multiples – Showing the change of a variable across different frames while holding the design consistent. Successful small multiples will be comparative, multivariate, and high density – showing important shifts in variables.

Interactive Analysis: An Overview

Interactive visualizations provide a feedback look not present in static visualizations, allowing faster alterations of visualizations to answer the questions we have. In the book Applied Security Visualization, Raffael Marty defines a paradigm for non-static visualizations. Building off of Ben Shneiderman’s Information Seeking Mantra, Marty’s new mantra involves these steps:

  1. Overview first
  2. Change graph attributes
  3. Zoom and filter
  4. Details on demand

The overview of the graph provides a springboard for subsequent visualizations – it defines the overall state of affairs. The second and third steps, called dynamic query, allow data from the database to be perpetually updated and displayed. The stores of data are usually queried by SQL or similar languages.  Marty  outlines some of the interesting properties of dynamic queries:

  • They show data context, showing similar entries to the results.
  • They allow for dynamic exploration, supporting “what if” analysis.
  • They provide for interactive exploration with sliders and other tools.
  • They allow for attribute exploration, giving users the opportunity to analyze any data dimension.

However, there are a few limits on dynamic queries. Most tools don’t support dynamic queries, with the few that do in the commercial space. Second, dynamic queries are limited by the amount of data available. For larger data sets, the scope of queries should be reduced.

Using  linked views is also useful for interactive analysis. One data set can be represented with different types of properties on multiple screens. A selection in one graph should translate into the other graphs appropriately. Attribute exploration allows for exploration of a single data dimension. How is this dimension significant? What are the distributions relating to this pattern? Are there any outliers?

Interactive analysis plays a large role in the larger security visualization activity: historical analysis. Historical analysis involves analyzing collected data sets, to find interesting patterns.  Interactive analysis allows users to more effectively comb through data collected in the past.

Lessons from Tufte, Applied to Security Visualization

In The Visual Display of Quantitative Information, Edward Tufte defines these main ‘axioms’ (guidelines) for data graphics:

  1.  Above all else show the data .
  2. Maximize data – ink ratio
  3. Erase non-data ink
  4. Erase redundant data-Ink
  5. Revise and edit

Of most importance in this list is maximize data-ink ratio. Data ink is the ‘non-erasable core’ of the graphic, the part of the display that actually communicates usable information to the viewer (the data itself). To maximize the data ink ratio, all (or most) non-data ink must be erased, leaving only the data-ink. The data, not any non-data components, should be clearly shown. As I’ve read through some data security visualizaton literature, I’ve noticed a trend:

Most visualizations aren’t maximizing the data-ink ratio.

Below I’ve identified some key visualizations that have high levels of non data-ink:

Top Challenges for Interaction in Extreme Scale Visual Analytics

In The Top 10 Challenges in Extreme-Scale Visual Analytics (2012) by Wong et. al, the authors create a subset of challenges for extreme scale visual analytics (VA).

  1.  In Situ Interactive Analysis – How can we streamline display and analysis of data on the fly?
  2. User-Driven Data Reduction – What kind of mechanism can we construct to allow users to control data reduction according to their needs?
  3. Scalability and Multilevel Hierarchy – What is the optimal resolution for the deep hierarchies found in large scale VA?
  4. Representing Evidence and Uncertainty – How can we prevent bias through visualization?
  5. Heterogeneous-Data Fusion – How can we effectively analyse heterogeneous data entities?
  6. Data Summarization and Triage for Interactive Query – How can we effectively manage interactive query and streamlined summarization of large scale data?
  7. Analytics of Temporally Evolved Features – How can we best visualize data changes over temporally long spans with minor fluctuation?
  8. The Human Bottleneck – How can we adapt and compensate for the unchanging cognitive ability of humans in lieu of constantly growing complexity of high-performance computing (HPC) components?

One challenge, representing evidence and uncertainty,  is particularly important. As noted in a previous post, some visualizations fail to scale up effectively, becoming muddled as more data is added to the canvas. Defining a uniform, scalable, and intuitive visual language is a critical next step in large scale analytics. Without a common language, a criteria,  against which graphs can be measured, aestheticism can end up muddling the display, hindering uniform analysis based on the visualization alone.

Wong, P. C., Shen, H. W., Johnson, C. R., Chen, C., & Ross, R. B. (2012). The top 10 challenges in extreme-scale visual analytics. Computer Graphics and Applications, IEEE32(4), 63-67.

Hive Plots vs “Hairballs”

One of the most consistent issues with Big Data Visualization is the inability of many graph types to scale to large sizes. In particular, node-link graphs have the tendency to resemble “hairballs” , or densely populated complex graphs. It is difficult, if not impossible, to discern a consistent pattern as the pool of data grows larger and larger. How can we solve this problem?

On solution comes in the form of fish-eye magnification, in which the graph will magnify as a person hovers their cursor over specific areas. Another solution includes highlighting nodes as the cursor hovers over them , or hiding unselected nodes completely.

Another more intriguing visualization removes the need for the arbitrary mode-link structure completely. This graph is called the hive plot, as coined by Martin Krzywinski. Each node Is mapped onto one of three axes, measuring node connectedness, centrality, and other measures.  As opposed to traditional link-node diagrams, which are arranged based on arbitrary arrangement, hive plots arrange nodes according to a rigid set of rules.

Each node is mapped onto an axis, and each edge is represented as a curved line between nodes. Different mappings of nodes onto axes will change the focus of analysis. For example, axis size may change depending on which node patterns need to be revealed. The standard structure of the hive plot allows for a genuine analysis of previously hard to read complex node-link graphs and comparison among different generations of nodes onto different hive-plots.

Many more visualizations can be generated by D3.js, a JavaScript library for manipulating data-driven documents.