Preface: my goal is not to solve captcha using automation tools, but to attempt to understand why a browser that is being launched by selenium is being identified as a bot in the first place, and how selenium contributes to this.
I use selenium to start up firefox and log onto a website to scrape some data a few times a day.
I started up a regular instance of firefox (that is, without selenium), went to the website, clicked the checkbox, and it determined that I was a human and let me go.
I then became curious what was the difference between me launching firefox through the executable, and me launching firefox through selenium. I decided to launch firefox using this piece of java code
WebDriver driver = new FirefoxDriver(new FirefoxProfile());
So I’m doing nothing much but starting firefox, using selenium. Which of course means a lot of stuff is going on under the hood, but perhaps the selenium instance of firefox is not “human” enough?
So I tried a few different things to try to look more human:
- Maybe I just need to browse. Like a human.
There are many theories that talk about things like mouse movement, keyboard strokes, etc. So the browser starts up, I type in the URL, I click a few other links, I come back to the login page, type in username + password, then proceed to click on the captcha box…and I’m a bot.
- Maybe I don’t have any cookies or browsing history?
Selenium by default creates a new profile, so it has no cookies or browsing history. I can specify a custom profile to use, so I simply passed in my own firefox profile stored in
APPDATA/roaming/mozilla/profiles. I verified that all of the websites that I have saved my credentials were there in the selenium-launched browser, but when I confronted the reCAPTCHA, it determined I was a bot and asked for image selection
- Maybe I need to use caching?
By default, selenium uses a custom cache path that is cleaned up after the session is over. In firefox you can see this by going to
about:cache and it will say something like
anonymous6337741624277931373webdriver-profilecache2, and there isn’t much there.
So I decided to use my own profile’s cache
And verified that all of my cached resources are there.
But it didn’t make a difference.
- Maybe I just need to solve the captcha once?
Now I’m thinking, OK, so if google thinks I’m a bot, how about I solve the captcha in the selenium-launched browser once, let them know I’m good, and then it won’t happen again? Maybe it identifies the browser as a new client, and just needs to know that this new client is not a bot.
So I solve the captcha and successfully log in. Then I logged out, returned to the login page, entered my credentials, pressed the reCAPTCHA box….and it asked me to solve the image selection problem again!
At this point I’m thinking, I just solved the captcha successfully half a minute ago, exhibited a bunch of manual human actions, but I’m still being identified as a bot.
Is there something specific about selenium that’s making google identify me as a bot automatically?
I would conclude that there is something on selenium’s end that is causing me to be identified as a bot, when I’m using the browser as a regular user.
Perhaps there are specific JS objects that are injected into the DOM that google picks up on?