Investigating Insider Trading: Big Data
Investigating Insider Trading via Big Data
Adarsh Kulkarni, Priya Mani, Suraj Kulkarni
Thomas Jefferson High School for Science and Technology
George Mason University Data Mining Lab
The financial markets have historically always been notoriously difficult to understand, which makes the tracking of white collar crimes such as Illegal Insider Trading all the more challenging to recognize. As such, the detection of illegal insider trades has been a characteristically human-driven process. We aim to challenge that precedent by taking a big data approach to the problem and apply various data mining techniques. Through the US Securities and Exchange Commission’s (SEC) Electronic Data Gathering, Analysis and Retrieval system (EDGAR), insider trade filings have been put into the public domain, providing a substantial source of data for researchers to utilize.
Insider trading is a subset of the numerous white collar crimes that contribute to the instability of the economy. Financial crimes and mishandling of billions of dollars are key issues in our society that have been blamed for some of the financial crises we’ve seen over the last decade. Working to detect and curb financial crimes like illegal insider trading is a clear interest for both the government and the people. The first step in that battle is to understand the patterns and investigate the behaviors of these people, which is what we aim to do in this study.
Insider Trading, by formal definition, is not always illegal. Generally, insiders in a company tend to be either Officers (CEO, CFO), large shareholders (>10%) or members of the Board of Directors. Other roles, such as Vice Presidents, can be considered insiders as well but most fall into those three categories. For these people, a substantial percentage of their compensation comes from stock and option awards. When they decide to liquidate their holdings, they file with the SEC and dispose of their shares. This process becomes illegal when the insider leverages information that only he or she may possess in order to trade stock at an unfair profit. For example, if the CEO of a company knows the stock will rise after announcing that they’ve surpassed their quarterly goals, and decides to buy stock before that announcement, the CEO is misusing his or her information to gain an unfair advantage.
It is important to note that one result of the stock-based compensation that is common among the C-suite, is that there are substantially more insider sales than there are insider purchases, and our data reflect that. Most insider trades are completely legal and are only completed because the insider intends to liquidate or diversify his holdings.
Currently, the SEC requires all insiders to file a Form 4 whenever they acquire or dispose of their company’s stock. These forms require the insider to declare how much they are trading, what price they are trading at, and how many shares they still hold after the trade. These are the filings that the SEC publishes through the EDGAR system, providing us with a substantial amount of data from mining and analyzing.
There are numerous online resources and websites that follow the EDGAR RSS Feed, and mine it for its data. We chose to use a company called Insider Monkey, which monitors EDGAR in real time and parses Form 4 filings as they are processed by the SEC and made public. From Insider Monkey, we scraped ~1.1M insider trades, as well as all insider positions that person has. For example, if we take the example of a Silicon Valley Executive, Marc Benioff, Insider Monkey provides us with his entire record of stock trades for Salesforce (CRM), as well as the fact that he is Chairman and CEO of Salesforce, a Large Shareholder at Fitbit (FIT) and a member of the Board of Directors at Cisco Systems (CSCO).
Insider Monkey provides data organized in this fashion for all publicly traded companies and their insiders, making it perfect for scraping. Using a combination of Selenium and Scrapy, we created a script that loads and then scrapes the information from each webpage. There are two different processes that run during this scraping. The first one is to go through the entire archive of insider trades available on Insider Monkey. All insider purchases can be found on one endpoint of the website, and all insider sales can be found on another endpoint. As the scraper ingests a trade record, it checks to see whether the particular insider has already been analyzed. If not, the scraper takes a detour to that insider’s webpage, scrapes their positions, and moves on in the trades.
Through this process, we’ve been able to compile insider trading data that include ~760k sale transactions and ~310k purchase transactions from ~70k insiders trading ~12.5k different companies. The scraping process took approximately 3 days to complete while running on George Mason University’s compute cluster, Argo. As mentioned before, the data consists of many more sale transactions than it does purchase transactions. If we again look at the insider Marc Benioff, we see that he sells Salesforce (CRM) stock every single day, but never purchases any, mainly because as founder and CEO, he already owns ~34,000,000 shares of CRM.
Taking a look at Figure 2, we find that majority of insiders in the corporate world are only part of a couple companies. In fact, ~97% of insiders are part of companies. It is important to also note that through qualitative analysis of our data, we see that most insiders that hold positions in companies are investment firms or banks. While an investment firm cannot hold the position of CEO or a similar insider title, it may own >10% of the company, and the SEC also requires owners with such a large stake to disclose their trades.
Once scraped, the issue remained on how to organize the data into a database in order for it to be easily accessible to analysis programs that we wrote. As it happens, even understanding how to input the data so that each insider would be correctly associated with all of his or her trades and positions was not trivial. As it happens, the unique identifier we used for each insider was their link address on Insider Monkey, as there may be multiple John Doe’s, however there cannot be multiple of the same links for multiple insiders.
We used a MySQL database to organize and store all our data, including historical prices for all the stocks that our insiders traded. We scraped the data for those prices directly from Google Finance. In fact, Google Finance has an extremely helpful feature where one can download a CSV of all historical prices. From there it was a matter of downloading every CSV and feeding it into the database. For every ticker in Nasdaq and NYSE, we have pricing data spanning as far back as the 1980’s for older corporations.
In Figure 3, we see that most insiders do not have a huge number of trades. Majority of insiders had trades throughout their time in the insider role. Those insiders who have thousands of trades generally fall into two categories. They may be a bank or investment firm with holdings in hundreds of different companies, or they may be executives who are also founders of the company and therefore have obscene amounts of stock. Again, looking at the example of Marc Benioff, he has ~34,000,000 shares and sells ~10,000 systematically every day. He has thousands of trades under his name, but that is because he is trying to liquidate his ~$2,500,000,000 in Salesforce stock.
There have been previous attempts at infusing the process for detecting illegal insider trading with more modern techniques. Goldberg et al. have found that over 85% of Insider Trading cases are correlated to five different types of news: Product announcements, earnings announcements, Regulatory approval or denials, Mergers and acquisitions or Research reports, which they collectively refer to as PERM-R events. The task of maintaining surveillance over trading activity is gargantuan, with ~5.5 million trades being found to be of interest by the Insider Trading and Fraud Teams at NASD, which is an organization that is self-run and maintains watch over multiple securities markets, the most notable being the Nasdaq Stock Exchange (Goldberg et al., 2003).
SONAR was designed by Goldberg et al. to automate as much of the manual process that NASD goes through every day as possible. Consolidating data from numerous data sources such as Reuters, Bloomberg and the Dow Jones, it uses Natural Language Processing to analyze 8,000-10,000 articles and correlate them with EDGAR filings as well as the rest of the market activity. Based on all this data, SONAR tries to flag suspicious trades that it believes must be more thoroughly investigated by a human analyst. Our work differs from Goldberg et al. in that we take a network approach to the problem and analyze the connections between one insider’s trading behavior to another.
Another study, Donoho’s, investigated the patterns created by insider trading in the options market. They focused on analyzing options as they are much less frequently traded, compared to stocks. It is much easier for an insider trading stock to be lost in the noise than in the options market. They conducted multiple case studies and found that whenever there was a PERM-R announcement similar to those described by Goldberg et al., options trading volume would spike.
One of their case studies described the acquisition of Pharmacia by Pfizer. Pharmacia stock opened 20% higher than its last closing price after announcing the acquisition over the weekend. Till then call volume had been steadily below 1,500 trades, but the days leading up to the announcement it rose to above 8,000. This was even before the official announcement. Options bought for $0.55 before the announcement could have been sold 3 days later for $4.10, a 650% increase. While Donoho concentrates on the options market which is indeed ripe for analysis, we focus on the standard stock market, hoping to detect illegal insider trades there.
More recently, in 2014, Tamersoy et al. mined similar data to our project and explored the relationship between trading behavior and insiders’ roles, their companies’ sectors, and their relationships to other insiders. They also performed anomaly detection over a network they created over all insiders in their dataset. They found that many insiders are part of cliques where all the trading behaviors of the members of the clique are very similar, as well as that insiders tend to have an abnormally high profitability of trades. Our findings concur with these researchers, but we also aim to take a different approach. We are examining our data on a per-company basis. We believe that while there is value in understanding the connections between different companies, majority of the benefit to be gained from a network-based analysis can be found when comparing all the insiders of a given company.
Observations and Analysis
As part of our analysis of similarity between the different insiders, we decided to construct graphs based on the trading behavior of the insiders in a certain company. The nodes would be the insiders, and the edges would be defined by a certain similarity function. At first, we used a simple similarity function that simply took into account the dates that the insiders traded, and what proportion of those dates were common among the two insiders. In
we define i1 and i2 as the set of dates for insider 1 and insider 2.
The first step to creating these graphs and the subsequent visualizations was generating adjacency matrices for all the insiders in each company. Using PyMySQL, a python package for interacting with a MySQL database, we created a script that goes through all insiders in a company and compares them to each other, deciding whether an edge should be defined between them. Once all the insiders for a company have been examined, we output a text file that contains the adjacency matrix for the company. It is important to note that for each company, we generated two adjacency matrices, one for the purchase behaviors of insiders and one for the sale behavior of insiders. We decided to separate purchases from sales since the patterns and anomalies that we find most likely are not common between the two, since they are opposing actions.
Using this adjacency matrix, we can create visualizations like Figure 2 for any company, and examine the nature of these networks. One notable feature of these networks is the prominence of cliques, or nodes in a graph that are all completely connected to each other. We see that in Figure 4, where all the insiders are connected to each other. Figure 5, while not quite as connected as Figure 4, also exhibits this behavior as there are smaller cliques inside the network. Another feature of these visualizations is that they reflect the extent to which the nodes are similar; the thicknesses of the edges is proportional to the magnitude of the similarity factor. It is also important to note that we utilized a threshold for the similarity values: for there to be an edge at all, we need S >= 0.5.
Naturally, the next step from here is to expand our similarity function to account for more details of the insiders’ trading behavior. So, we moved from simply using common trade dates to considering the longest common subsequences of trades dates between any two insiders. Further expansions can be done on the similarity function, where we take into account the volume of trading behavior, the price and the total holdings of the insider and how the trade compares to their overall share of the stock.
Another aspect of our analysis included the construction and examination of the connected components across the graphs of each company. This allows us to see just how extensive these networks can be.
In order to generate these connected components, we used a python package called NetworkX, which provides a number of tools and algorithms that are readily available to be implemented on graphs. Again, this analysis was done with both purchases and sales being kept separate. As such, there was a set of connected components for each company’s insiders’ sales, and a set for the company’s insiders’ purchases.
Another analysis approach that we took was with ego-nets, where we would treat one insider as the central ego, and build a network around them. In Figures 8 and 9, we see an example of this ego-net, where the red node is the designated ego, and the other blue nodes are other insiders.
When we qualitatively examine the graphs of the more anomalous nodes, we find that they have a very unique characteristic. As is the case in Figures 8 and 9, they are members of multiple, unconnected, or quasi, cliques.
We found that there is a power law relationship between the number of neighbors an ego has and the number of edges that the ego has. This power law, shown in Figure 10, allows us to see how far off the norm an insider’s behavior is, and thereby flag them as anomalous nodes. We used this method to perform anomaly detection on the ego-nets and detect insiders that seem to be suspicious.
To do this, we created a system for scoring how much of an outlier a given insider is. First, we created a local outlier score that is based on the least squares fit which is calculated on the median values. We define the second component of our outlier score as follows,
resulting in the TotalOutlierScore(u) = Score(u) + LocalOutlierFactor(u).
Our third approach for analysis was to investigate the profitability of insider trades. We went through each insider and compared their purchase or sale price to the closing price of the stock and deduced whether they made a profit on their trade. We track this profit based on
where TP is the trade price, ST is the stock trade and DV is the dollar volume of trading that occurred for that particular stock on that trade day. Then we compare this R-value to the closing price of the stock on that day, and if the insider made a profit we leave the R-value as positive, and if the insider didn’t make a profit we make it negative.
That is, if the insider sold stock and the closing price was lower than the trade price, he or she made a profit, and if the insider bought stock with the closing price being higher than the trade price, the insider made a profit. The presence of a consistent profit might be an indication of a suspicious behavior, which can then be cross-referenced against events that affect that market such as any of the PERM-R events described by Goldberg et al.
There are still numerous opportunities to continue this research and create more robust anomaly detection that can detect illegal insider trading, and here we outline a couple of them. In our network analysis, we use a similarity function that initially considers commonality of trade dates, and then the length of the longest subsequence of common trade dates. This similarity function has much room for expansion, such that it can take into account volume of the trades, the overall holdings of the insiders, as well as the relative prices at which they are executed and how profitable those trades are in the short-term (e.g. did the insider make money in the first week after the trade).
Additionally, expanding our system to also take into account derivatives trading behavior, such as put and call options, could help improve our anomaly detection by adding another layer of analysis. As Donoho found, the volume of options trading has a strong correlation with potential opportunities for insider trading, and it is a field worth exploring in further depth and integrating into our study.
Throughout history, there have been some high-profile and devastating cases of illegal insider trading that we now know the details of. Additionally, the SEC maintains a record of all illegal insider trading cases that it is currently investigating. One method of improving our system would be to mine these cases and train the system on these positive examples.
Overall, this work on the detection of illegal insider trading is part of a larger investigation of patterns in the economy. In the future, we hope to take our patterns and findings and input behaviors into an agent-based simulation, and try to understand the effects that crimes like illegal insider trading have on our society by running the simulation faster than real time.
We believe our work has explored an area of incredible need with novel techniques and paves the way for much more research to come. We took a unique approach with our network based analysis, where we constructed graphs with insiders as nodes and defined edges based on the similarity of their trading behavior. We found that cliques are fairly common in these graphs, as well as the fact that there are certain companies and insiders in our data that exhibit what we define as anomalous behavior. We do not take the matter any further as we are still developing our system and hope to improve our anomaly detection.
Acar Tamersoy, Elias Khalil, Bo Xie, Stephen L. Lenkey, Bryan R. Routledge. Duen Horng Chau, Shamkant B. Navathe. Large Scale Insider Trading Analysis: Patterns and Discoveries. Social Network Analysis and Mining (SNAM), 4(1), 1-17. 2014.
Bay, S., Kumaraswamy, K., Anderle, M.G., Kumar, R., Steier, D.M.: Large scale detection of irregularities in accounting data. In: Proceedings of the IEEE International Conference on Data Mining (2006)
Cicero D, Wintoki MB: Insider trading patterns (Working paper). (2013) Available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2128127
Donoho, S.: Early detection of insider trading in option markets. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2004)
Goldberg, H.G., Kirkland, J.D., Lee, D., Shyr, P., Thakker, D.: The NASD securities observation, new analysis and regulation system (SONAR). In: Proceedings of the Conference on Innovative Applications of Artificial Intelligence (2003)
McGlohon, M., Bay, S., Anderle, M.G., Steier, D.M., Faloutsos, C.: SNARE: A link analytic system for graph labeling and risk detection. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2009)