Preface: my goal is not to solve using automation tools, but to attempt to understand why a that is being launched by selenium is being identified as a bot in the first place, and how selenium contributes to this.

I use selenium to start up firefox and log onto a website to scrape some data a few times a day.

Recently the website changed their login system by adding ’s reCAPTCHA, and everytime I try to click the checkbox, determines that I am a bot and asks me to select a bunch of images.

I started up a regular instance of firefox (that is, without selenium), went to the website, clicked the checkbox, and it determined that I was a human and let me go.

I then became curious what was the difference between me launching firefox through the executable, and me launching firefox through selenium. I decided to launch firefox using this piece of java code

WebDriver driver = new FirefoxDriver(new FirefoxProfile());

So I’m doing nothing much but starting firefox, using selenium. Which of course means a lot of stuff is going on under the hood, but perhaps the selenium instance of firefox is not “human” enough?

So I tried a few different things to try to look more human:

  1. Maybe I just need to browse. Like a human.

There are many theories that talk about things like mouse movement, keyboard strokes, etc. So the browser starts up, I type in the URL, I click a few other links, I come back to the login page, type in username + password, then proceed to click on the captcha box…and I’m a bot.

  1. Maybe I don’t have any cookies or history?

Selenium by default creates a new profile, so it has no cookies or browsing history. I can specify a custom profile to use, so I simply passed in my own firefox profile stored in APPDATA/roaming/mozilla/profiles. I verified that all of the websites that I have saved my credentials were there in the selenium-launched browser, but when I confronted the reCAPTCHA, it determined I was a bot and asked for image selection

  1. Maybe I need to use caching?

By default, selenium uses a custom cache path that is cleaned up after the session is over. In firefox you can see this by going to about:cache and it will say something like anonymous6337741624277931373webdriver-profilecache2, and there isn’t much there.

So I decided to use my own profile’s cache

profile.setPreference("browser.cache.disk.parent_directory", PATH_TO_MY_PROFILE_CACHE);

And verified that all of my cached resources are there.
But it didn’t make a difference.

  1. Maybe I just need to solve the captcha once?

Now I’m thinking, OK, so if google thinks I’m a bot, how about I solve the captcha in the selenium-launched browser once, let them know I’m good, and then it won’t happen again? Maybe it identifies the browser as a new client, and just needs to know that this new client is not a bot.

So I solve the captcha and successfully log in. Then I logged out, returned to the login page, entered my credentials, pressed the reCAPTCHA box….and it asked me to solve the image selection problem again!

At this point I’m thinking, I just solved the captcha successfully half a minute ago, exhibited a bunch of manual human actions, but I’m still being identified as a bot.

Is there something specific about selenium that’s making google identify me as a bot automatically?

I have used a custom profile, with custom cache path. I use cookies. I have all my regular extensions that I have installed on my profile. My user agent is unspoofed and it’s no different from my normal browsing experience. There’s nothing in the request headers that would suggest it is any different from a regular browser.

I would conclude that there is something on selenium’s end that is causing me to be identified as a bot, when I’m using the browser as a regular user.

Perhaps there are specific JS objects that are injected into the DOM that google picks up on?

Source link


Please enter your comment!
Please enter your name here