Whether you’re polling the opinion of your reader base, taking votes for an award or running a competition on your website, you probably have some concerns about your system being abused by the hate machine we call the internet. Someone may skew a poll to sway the public perception of popular opinion; Some prize pig may spam an online competition to hog all the prizes. If you’re running any form of online submission system, the measures to prevent abuse need to increase proportionally with the incentive to break them. Why do I say proportional? Because there’s a trade-off, and it’s a bit of a balancing act.
The trade-off is between the effectiveness of the security measure, the difficulty of implementation and the amount it inconveniences the user. You don’t want to waste all of your resources implementing an ineffective solution that does nothing but prevent legitimate users from participating (trust me, users are pretty lazy).
Here’s a rundown on all of the different methods you could use to reduce fraud.
Use a Script to Hide the Submission Form
Effectiveness:
Implementation difficulty:
User Inconvenience:
I’ll be upfront and explain why this is a very bad idea: JavaScript runs at the client – a machine that you have literally zero control over, and must disclose the code to in plain text. Any client can analyse and modify the code. Using JavaScript to hide the poll submission form and replace it with the running poll results may work at a functional level, but it offers nearly nothing for security.
Here’s a quick example of this malpractice, plucked line-for-line from News Limited’s content system:
if (!poll.hasVoted(pollid)) { $poll.append(inputElement); $poll.submit(function () { poll.send($poll, pollid, opts); return false; }); } else { poll.showResult($poll, pollid, opts); }
Basically, you’ll see the poll submission form appear while the page is loading, and it will then get replaced by the poll results. If a user is fast enough, they can submit a vote while the page is still loading. Predictably, the poll.hasVoted() function evaluates a cookie to check if the user has already voted:
hasVoted: function (pollid) { var cookieValue = $.cookie('pollVotes'); if (cookieValue !== null && cookieValue.match(pollid) !== null) { return true; } else { return false; } },
This means that circumventing this measure in a browser is as trivial as clearing the cookies and refreshing the page. As this measure merely prevents the poll from being displayed in a browser, a bot will not be even slightly hindered by it because it does not need to evaluate the markup or draw the page.
Remember that not everyone browses JavaScript. There are those tinfoil hatters who don’t trust scripts. They’re few and far between, but you should be aware that using JavaScript does punish those running NoScript.
Require a Session Cookie with the Submission
Effectiveness:
Implementation difficulty:
User Inconvenience:
We saw that News Limited uses session cookies to determine if you’ve already voted, but the server does not require this cookie when submitting a vote. An example of a voting system that does is on the ABC’s website for “The Drum“.
The server checks if a vote has already been cast for the session, and silently rejects all subsequent votes.
How it can be broken:
The problem is that is extremely easy to dispose of a session and generate a new one. The usual trick of clearing your browser’s cookies will work. Curl comes with a built in cookie engine, so making a script perform the task of creating a session and using it to vote is as simple pre-fetching the page and dumping the cookie to a file, then immediately voting with the cookie:
#!/bin/bash while [ true ]; do curl "http://www.abc.net.au/news/thedrum/polls/" -j -c "cookie" curl --data "m_hhidRound=1439&m_hhidContentIndex=0&CSS=&r=http%3A%2F%2Fwww.abc.net.au%2Fnews%2Fthedrum%2Fpolls%2F&m_ucCandidateSelectionList%3Am_ucSelectionList%3Am_ctlSelectionList%3AVoteRadioGroupCandidates=61814" "http://www2b.abc.net.au/votecentral/View/SubmitVote.aspx" -b "cookie" -L done
It does halve the throughput of a bot, but it’s still surprisingly fast.
Set a Submission Ticket from Within a Script
Effectiveness:
Implementation difficulty:
User Inconvenience:
This method can be effective when done properly. You could write some JavaScript that generates a ticket through some cleverly obfuscated code that gets validated by the server when submitting a vote. The only requirement at the client side is that they have JavaScript enabled.
The Age provided a pretty good example of how not to implement this:
<script type="text/javascript"> document.cookie = "checkIfCookiesEnabled=cookiesEnabled; path=/"; </script>
It’s a static string. Every poll gets submitted with the same value. All you need to do is copypasta that value into curl and you’re done:
--cookie "checkIfCookiesEnabled=cookiesEnabled; path=/"
What you actually need to do is make the cookie contain a different random value for each vote, and don’t accept the same value twice. Think of it like a serial key. You would have some JavaScript generate something at random, but it still needs to meet certain checks (such as a CRC). The server would then verify the code when a vote is submitted.
How it can be broken:
The problem, of course, is that JavaScript code is open, and can easily be replicated inside a bot. However, it does make it much more work, and if you obfuscate the algorithm by scattering the steps throughout the code (and running the code through a compressor like jsmin), it would make a lot more work for a would-be hacker to replicate inside their own bot.
Limit the Number of Votes Coming from a Single IP Address
Effectiveness:
Implementation difficulty:
User Inconvenience:
This is the couter-measure taken by both Fairfax and News Limited after their poll systems had been broken. The premise is very simple: An ISP typically gives a user a single IP address. It is very easy to determine the source IP address from the server end, and an IP address cannot be spoofed. It’s fairly secure, and the end user won’t even notice that it’s being checked.
Well, sometimes they will.
Many large companies run their entire networks NATted behind a single IP address. In some companies, this can be hundreds of machines. If only one vote can be accepted from a single IP, then entire corporations will get a single vote between them (or a limited number). It’s a balancing act. You could give each IP 10 votes, or one vote every few minutes. The end result is the same – you’re giving an entire corporations as much voting power as a single user, and you’re still relying on people being honest. Besides, there’s a way around it…
Using tor, a single user can change their IP address on demand. This is done by forwarding requests through another person’s computer. Tor exit node operators agree to allow others to use their internet connection for almost anything they want.
Once tor has been installed, running curl through a proxy and jumping between exit nodes every 10 votes or so is fairly straightforward:
#!/bin/bash while [ true ]; do COUNTER=0 while [ $COUNTER -lt 10 ]; do curl --data "pollId=4016070&indexUrlPath=http%3A%2F%2Fwww.theage.com.au%2Fpolls%2Fopinion%2Ftarkine-wilderness-20130208-2e2ti.html%23poll&id=27263" -x "http://localhost:8118" --cookie "checkIfCookiesEnabled=cookiesEnabled; path=/" http://feedback.theage.com.au/action/voteForAPoll let COUNTER+=1 done (echo authenticate '""'; echo signal newnym; echo quit) | nc localhost 9051 done
This was (and still is) enough to get around the security measures at both News Limited and Fairfax. You’d think that if someone could change the only identifying characteristic of their machine every few seconds, you’re screwed, right? Not really.
How many tor exit nodes do you think there are? At the time of writing, there are less than 900 worldwide, and only 12 operating in Australia (myself being one of them). The list of nodes is updated every few minutes. If you were to block this small number of sources, you’re effectively blocking this attack vector while only rejecting legitimate traffic from a small handful of potential users. The only site I know of that implements this is imgur, and I’m pretty sure it’s to stop people from posting CP.
Require Email Verification
Effectiveness:
Implementation difficulty:
User Inconvenience:
After submitting the form, an email will be sent to the user, containing a verification hyperlink to validate the vote.
Most users will see this as a sneaky way to harvest your email address, the rest of them are lazy. All in all, expect your participation to decline by orders of magnitude. Sure, it initially seemed like it was worth the effort to vote for Taylor Swift to perform for a school of deaf children, but I was getting daily bombardments from her fan club for weeks after, until I finally got annoyed enough to unsubscribe.
Here’s the worst part: It won’t stop a ballot-stuffer. Like most IT pros, I own a several domains, have a mail server and know how to set up a “catch-all” account. Using a catch-all, I don’t need to create a new email account for each vote. I can just increment and repeat. I could submit a vote for a@ubermotive.com, b@ubermotive.com, c@ubermotive.com and so on. All of the verification emails will land in a single account at my end, and a script could then parse the response emails and follow the email verification links. It’s a lot of effort, but it’s not impossible.
Require Registration on Your Site
Effectiveness:
Implementation difficulty:
User Inconvenience:
If you only let each account to submit one vote, the only way to get multiple votes in is through creating multiple accounts. This effectively moves the security focus from the voting stage to the account creation process. On one hand, whatever annoyances you present to the user to ensure they are a unique human, you will only need to do once. On the other hand, many users won’t even bother to sign up and your participation rates will suffer. You may want to have a fallback voting method for users that are not logged in.
Reddit is a good example of a site that requires a login to (up)vote. I would imagine that its participation rates would be higher if the logins weren’t required. Looking at posts on the front page, you’ll see that imgur links with 250,000 views only have 5,000 votes up or down. This is indicative that around 2% of viewers actually participating in voting.
Challenge the User’s Cognitive Skills
Effectiveness:
Implementation difficulty:
User Inconvenience:
I’m talking about CAPTCHA. You know, those wavy, washy, warped bits of text that bring on nausea just by trying to decipher them? It’s a given that you won’t be able to solve them with an automated process. Ways have been discovered in the past, and they have been quickly patched. Since the same system is used by almost every website in existence, the odds of your site being targeted during the brief time the system is considered vulnerable is so close to zero that you needn’t worry about it.
One could circumvent these by using a relay attack. This requires paying around a buck for every 1000 puzzles solved by a human. You know you’re going to extreme measures when you’re paying a handful of people in third-world countries to carry out your bidding. Also, you can likely prevent such attacks by also monitoring the source IP as well.
Audit Your Data
Effectiveness:
Implementation difficulty:
User Inconvenience:
This is by far the best measure you can take to stop you from making a fool of yourself. No matter how well you think you’ve secured your system, nothing will fool a set of trained eyes.
While all other techniques are about prevention, this one is about diagnosis and cure. If you log enough information (which is not really that much), you can publish your results with a good certainty that your results are not tainted.
By day, I write software for logging, graphing, validating and correcting data. Every point of data that my company publishes gets displayed graphically, checked over by a human, and signed off by an authorised NATA signatory before it gets published. Every fragment of data is logged. Every change to the data leaves an audit trail. You’d be amazed how easy it is to spot invalid data once it’s drawn in front of you. For user-submitted data, I’d recommend logging the following:
- Date and time the data was submitted
- IP address it was submitted from
- The browser’s user agent string
- All submitted fields
For a simple vote, there’s only one field submitted (the vote option). For a competition, this would also include the entrant’s details. Obviously, you’d be logging all of these details for a competition entry anyway.
Look for patterns, and make sure they are what you’d expect to see. A voting bot would show a sudden spike in traffic, followed by a consistent load that doesn’t slow down at 3 AM like you’d expect. If you graph the number of votes per hour for each option, a sudden spike for a single option would become blatantly obvious. You can group votes by IP address and check for what appears to be too many for a single source. If you’re suspicious, make sure that the option distribution is in-line with submissions from other IP addresses.
During my poll-rigging escapade at News Limited, the one any only time I was genuinely impressed was during the tally for the AACTA People’s Choice Award – which was audited by Ernst & Young. They had a duplicate detection system in place, however it was seriously flawed because the API fired back a message in real-time to indicate whether the submission was detected as a duplicate or not. After populating the fields with randomly chosen dictionary words, I had submitted what would have to have been millions of “unique” votes for a scene from Underground: The Julian Assange Story.
So there I was, on awards night sitting on my sofa with beer and popcorn, when this came up:
For once, someone had actually taken the effort to check their data before blindly drawing conclusions from it. Instead of the award getting presented to a scene where Julian Assange phreaked a telephone exchange to play a perfectly-timed prank on a police sergeant, the award rightly went to a scene depicting an abo getting socked in the face. This country sure knows a cinematic masterpiece when it sees it. But, I digress…
The auditors and Ernst & Young did their job, and did it well. They went to some effort to prevent duplicate submissions automatically, but most importantly, they did not rely on it.
This brings us to the last topic…
Programatically Detect Duplicates
Effectiveness:
Implementation difficulty:
User Inconvenience:
I’ve had plenty of experience in writing duplicate detection algorithms. I once had a gig of fixing tens of thousands of duplicate accounts in a $30 billion fund. It was a process that autonomously moved tens of millions of dollars between accounts, and refunded over $1 million in duplicate account keeping fees. It had to scan a member base of nearly 1 million members, and it did it all in a coffee break. That said, the project cost well into six figures of consulting fees alone and took over a year to complete. It’s quite a resource-intensive feat.
For an online poll, the stakes are probably much lower. However, it’s also not that hard to detect a bot submitting the same thing over and over again. If you get it right, a legitimate user would not be inconvenienced in the slightest.
However, there’s one mistake that news.com.au made in the AACTA awards that you should not make. Do not immediately inform the client that their vote has been discarded. Instead, log the vote, tell the client that it was accepted, but silently mark it as a duplicate. This will have them think that what they are doing is working, and they will not attempt to circumvent it. If someone hooks their bot up to a random data generator, or worse, an electoral roll, you’re going to have a harder time removing the duplicates later. It’s best to make them think they don’t have to. Then, when you’re conducting your final tally, you can take a quick glance over the detected duplicates, and expunge them. Remember, your algorithm probably isn’t infallible – always inspect what you are deleting for false positives.
Use a Combined Approach
If you’ve got the time to invest, I’d recommend implementing a combination of the above methods. If you focus on the methods which do not inconvenience the user, you can still achieve the same participation rates that you had before (unless your recorded rates are completely bogus, of course).
Leave a Reply