PhantomJsCloud Documentation

These docs are a work in progress. If you have any questions or comments, please let us know.

The PhantomJsCloud API is organized around a REST-like, "JSON API" WebService. The requests are made by submitting a request.json payload describing your PageRequest, and we send back your results (as JPEG, or in another renderType format you specify in your UserRequest) along with a HTTP response code indicating any errors, and HTTP response headers to inform you of important metadata (page cost, etc).

JSON API

A JSON based HTTP Endpoint allowing full access to all PhantomJsCloud features.
Standard JSON API

This is the most direct way of interacting with PhantomJsCloud. If you are comfortable composing GET or POST HTTP-Requests directly in your language of choice, you can use our HTTP Endpoint.

Automation API NEW!

This new API allows full flexibility. Most importantly, it allows simple and straightforward means to type on the keyboard, tap the screen, click with the mouse. See the New Automation API Docs

Node.js API

Access to the JSON API via a strongly typed Node Library. Includes autoscaling helpers and the new Automation API.
Node.js

If you use Javascript or Typescript, you can use our Official NPM Module. This can also be used in browsers via the Browserify and Webpack projects.

Other Language Examples

These examples show how to leverage the JSON API from various languages.
Guidelines when using these "Other Language" examples
Authentication

The examples use the demo ApiKey a-demo-key-with-low-quota-per-ip-address to make requests (Located in the "url line" section of the example). This demo key is limited to 100 requests per day. You can create a Free account to get 500 Pages/Day at Dashboard.PhantomJsCloud.com. Then when following these examples, replace the demo key with the ApiKey found on your account dashboard page. For example: https://PhantomJsCloud.com/api/browser/v2/ak-012345-abcde-012345-abcde-012345/

HTTP Response Status Codes and Headers

Most languages provide access to the response statusCode and headers. Please refer to the "Basic Troubleshooting" section (below) for descriptions of these.


Usage FAQ

These are fully described in the HTTP Endpoint docs, but is pretty-formatted here:

  • plainText For Web Content Scraping. If you need a page's fully rendered DOM, simply saving the HTML source won't cut it. Use our REST API with this output format (the default) and scrape the resulting HTML as usual. The page's JavaScript will be fully executed, and all DOM transformations completed.
  • jpg, jpeg, png For visual inspection. If you need to generate page previews, archive screenshots, or create thumbnails, this renders the page sends the result as JPEG or PNG.
  • pdf For Archiving and Reports. Create a PDF of the page or uploaded HTML, including all images, svg graphics, headers and footers.
  • html Returns the target page in it's "native" form, including all response headers intact. Useful for generating static versions of your Single-Page-App / AJAX Data, or for proxied requests. Very useful for SEO of Facebook / Twitter / Yahoo / Bing web bots.
  • automation NEW For advanced users. To use this properly you must read the Automation API Docs. Allows unparalleled control over the browser:
    • Simulate Human Input: Keyboard, Mouse, Touchscreen
    • Multi-page Navigation: Login, follow dynamic links, etc.
    • Multiple Renders:Screenshot, pdf, html/text extraction. when and how many you want.
    • Puppeteer Support: Most Pptr API's are implemented in a secure ES2018 sandbox
  • script DEPRECATED: please use "automation" instead. Manipulate and extract data from any webpage. Using the Script Injection feature, you can execute arbitrary JavaScript on any page, and can also specify exactly what data you want returned from your request. This is a powerful feature, even allowing you to construct custom API Endpoints! To use this properly, you need to read the HTTP Endpoint docs regarding Script injection

You have three choices for using proxies with PhantomJsCloud:

  1. Geolocation using Static IP: Some organizations need PhantomJsCloud requests to come from a static IP address, usually for security (whitelist) purposes.
  2. Anonymous Proxy: Using this feature will make each API Request use a different (anonymous) IP Address. This feature is optional, and using it costs an additional $0.50/gb data ingress.
  3. Custom Proxy If you want to use a 3rd party proxy, that still works normally.

Here are some examples using the 3 Proxy types:


//POST request JSON payload to use a worldwide anonymous proxy
{ url:"https://phantomjscloud.com/examples/helpers/requestdata", proxy:"anon-any"}
//anonymous proxy from Netherlands
{ url:"https://phantomjscloud.com/examples/helpers/requestdata", proxy:"anon-nl"}
//static IP from USA (35.188.112.61)
{ url:"https://phantomjscloud.com/examples/helpers/requestdata", proxy:"geo-us"}
//use your custom 3rd party proxy
{ url:"https://phantomjscloud.com/examples/helpers/requestdata", proxy:"custom-http://myProxy.com:8838:myname:secret"}
                                

Please refer to the docs for additional proxy configuration details.

If you use our Node.js Client API Library, you can submit as many requests as you wish and your calls will gracefully be handled. This is because the Node.js Client API Library has built-in rate-limiting and autoscale handling. If you need to do batched requests with other languages, please refer to the Testing and Performance Optimization section below.

Once a resource is loaded, it is normally cached. This means that another page request that loads the same resource will not make a network request, and thus there will not be any load for us to record in the pageResponses.events section.

To force all resources to load, you should pass the pageRequest.requestSettings.clearCache:true parameter. This is also helpful if you are making changes to the resource and want to make sure the newest version is the one used by your call to PhantomJsCloud.

Below is an example of what you would see in the pageResponses.events section of your JSON response, if the resource being loaded is https://example.com/resource.css:

{
"key": "resourceRequested",
"time": "2016-05-24T15:36:50.376Z",
"value": {
"resourceRequest": {
"headers": "OUTPUT SUPPRESSED (Disabled to reduce verbosity of your JSON.  You can enable by removing the related entry in your pageRequest.suppressJson settings)",
"id": 172,
"method": "GET",
"time": "2016-05-24T15:36:50.376Z",
"url": "https://example.com/resource.css"
}
}
},
{
"key": "resourceReceived",
"time": "2016-05-24T15:36:50.482Z",
"value": {
"resourceResponse": {
"body": "",
"bodySize": 3654,
"contentType": "application/javascript",
"headers": "OUTPUT SUPPRESSED (Disabled to reduce verbosity of your JSON.  You can enable by removing the related entry in your pageRequest.suppressJson settings)",
"id": 172,
"redirectURL": null,
"stage": "start",
"status": 200,
"statusText": "OK",
"time": "2016-05-24T15:36:50.482Z",
"url": "https://example.com/resource.css"
}
}
},
{
"key": "resourceReceived",
"time": "2016-05-24T15:36:50.492Z",
"value": {
"resourceResponse": {
"contentType": "application/javascript",
"headers": "OUTPUT SUPPRESSED (Disabled to reduce verbosity of your JSON.  You can enable by removing the related entry in your pageRequest.suppressJson settings)",
"id": 172,
"redirectURL": null,
"stage": "end",
"status": 200,
"statusText": "OK",
"time": "2016-05-24T15:36:50.491Z",
"url": "https://example.com/resource.css"
}
}
},
{
"key": "resourceReceived",
"time": "2016-05-24T15:36:50.492Z",
"value": {
"url": "https://example.com/resource.css",
"status": 200
}
},

If there is a problem with your script you need to debug, use the outputAsJson:true parameter then search the output for the term browserError which will be under pageResponse.events. This should give you an idea of any syntax errors your script may have caused.


Advanced Automation Samples

These Samples show and explain how to use the overseerScript to perform advanced automation techniques. To use these properly you should understand:

  • The Basics: Please familiarize yourself with the JSON API Docs and the basic examples on the same page.
  • Modern Javascript: The overseerScript executes in a secure ES2018 Javascript Sandbox. At least be familiar with the await keyword (MDN docs here)
  • Automation API: Browser through the Automation API Docs and associated examples.
  • Chrome Inspector: Know how to use the Chrome Inspector to determine the querySelector string of an element. This is needed to instruct which element to click on, etc.

If you have a request for another scenario sample, please let us know!

We created an example button-click page to help illustrate this process. The following request.json uses the New Automation API to click the What is the time? button and wait for the demo_result element to become populated before taking a screenshot.

{
    "url":"https://phantomjscloud.com/static-samples/button-click.html",
    "renderType": "jpeg",
    "overseerScript":'page.manualWait(); await page.waitForSelector("button#dateBtn"); page.click("button#dateBtn"); await page.waitForFunction(()=>document.querySelector("#demo_result").textContent!==""); page.done();'
}

In the above request, we inject an overseerScript that:

  1. page.manualWait(); Informs the API to only complete when page.done() is called.
  2. await page.waitForSelector("button#dateBtn"); wait until the button#dateBtn element is present in the HTML
  3. page.click("button#dateBtn"); Clicks the button
  4. await page.waitForFunction(()=>document.querySelector("#demo_result").textContent!==""); Waits for the #demo_result element to have anything but blank.
  5. page.done(); Signal to the API that we are done and the render can occur. (renderType:"jpeg" in the request JSON)

Please see the New Automation API docs for more details on this powerful automation workflow, and also for more examples. Also let us know. if you have any questions or need help.

We created an example button-click-navigate page to help illustrate this process. The following request.json will click the Nav to another page button and wait for the navigation to take place, then performs the same workflow as the prior "Page Automation: How can I click a button" Advanced Scenario Sample: click the dateBtn Button and wait for demo_result element to become populated before taking a screenshot.

{
    "url": "https://PhantomJsCloud.com/static-samples/button-click-navigate.html",
    "renderType": "jpeg",
    "overseerScript":'await page.waitForSelector("button#navBtn"); page.click("button#navBtn"); await page.waitForNavigation();',
}

In the above request, we inject an overseerScript that:

  1. await page.waitForSelector("button#navBtn"); wait until the button#navBtn element is present in the HTML
  2. page.click("button#navBtn"); Clicks the button
  3. await page.waitForNavigation(); Waits for a page navigation to occur. (side affect of clicking the button in step #2)

Contrast this sample with the preceeding "Page Automation: How can I click a button on a page and wait for it to update the page?" example. This is very similar except does not use the page.manualWait(); or page.done(); calls. This is because calling page.done(); is only required if you need to wait past the execution of the overseerScript (such as if there were a setInterval() asynchronous call).

Please see the New Automation API docs for more details on this powerful automation workflow, and also for more examples. Also let us know. if you have any questions or need help.

Building off what you learned in the above "How can I load a page, navigate to another..." sample, Here is an example request.json that will login to LinkedIn and capture a screenshot of your home page:

{
    "url": "https://www.linkedin.com/uas/login",
    "renderType": "jpeg",
    "overseerScript":'let _user="USER@EXAMPLE.COM"; let _pass="PASSWORD"; await page.waitForSelector("input#username"); await page.type("input#username",_user,{delay:50}); await page.type("input#password",_pass,{delay:50}); page.click("button[type=submit]"); await page.waitForNavigation();',
}

In the above request, we inject an overseerScript that:

  1. let _user="USER@EXAMPLE.COM"; let _pass="PASSWORD"; populates the username/password you will use to login.
  2. await page.waitForSelector("input#username"); wait until the "input#username" element is present in the HTML
  3. await page.type("input#username",_user,{delay:50}); await page.type("input#password",_pass,{delay:50}); Types the username/password into their respective input elements slowly like a human would.
  4. page.click("button[type=submit]"); Clicks the Submit button.
  5. await page.waitForNavigation(); Waits for a page navigation to occur. (side affect of clicking the button in step #4)

Please see the New Automation API docs for more details on this powerful automation workflow, and also for more examples. Also let us know. if you have any questions or need help.



Basic Troubleshooting

502 Bad Gateway Errors

If you are getting 502 Bad Gateway errors frequently, be sure that the ExpectContinue header to false. Some platforms (C# and Curl) set this to true by default, so be sure to change it!

If you get 502 errors frequently and this does not solve your problem, please let us know.

Request takes a long time

By default PhantomJsCloud waits for your target page to finish loading. If a page has a lot of AJAX (ads, lazy content, etc) it could take a long time. To make the page finish faster (and thus your API call complete faster) you can try finishing at the page DomContentLoaded event, in one of these two ways:

  • Automation: add overseerScript:'await page.waitForNavigation("domcontentloaded"); page.done()' to your request.json. Read more about this technique here
  • RequestSettings: add requestSettings:{doneWhen:[{event:"domReady"}]} to your request.json. Read more about this technique here

Of the above two methods, we suggest the Automation technique as it allows more flexibility (access to the entire Automation API), such as adding page.waitForSelector("input#someId") to ensure a certain DOM element exists.

Debugging Page Errors: Status Codes

When processing the results you receive from PhantomJsCloud, be sure you pay attention to the two types of statusCode results. Be aware that these two statusCodes have separate meanings.

  • Response StatusCode: The HTTP StatusCode returned from PhantomJsCloud will normally be 200 unless there was a problem processing the request. For example: if the target server is offline or if the request is invalid. If there is a timeout requesting the target URL a 424 Failed Dependency error will be returned. When a Response Failure is sent to you, we try to provide useful data in the statusCode_Help parameter.

    Here is a general description of the Response Status Codes we send and what they mean.

    • 200: OK The target page was captured properly.
    • 400: Bad Request Your request had an error in it. Fix it before resubmitting.
    • 401: Unauthorized You are using an invalid Api Key. Please check for typos, or create an account.
    • 402: Payment Required Your account is out of credits. Login and either upgrade your Subscription or add Prepaid Credits.
    • 403: Forbidden Your request was flagged due to abuse. Read the response for steps you should take to resolve the situation.
    • 424: Failed Dependency The target page was not reachable (the request timed out). Check and make sure your target URL is valid before retrying, or make sure your requestSettings.maxWait parameter is set to be long enough.

      Extra Info: The 424 error is returned when the primary page URL does not load. Some reasons this could occur:

      • Proxy Failure: If using a proxy server and the proxy does not respond.
      • Target Blacklists Request: Some hosts use anti-bot systems that will drop connections instead of replying with a proper error code.

      We just return 424 to inform you that *something* didn't finish loading. If you need more details on what that something was, use the outputAsJson=true parameter and look at the pageResponse.events node, which will show a timeline of sub-resources (request and response).

    • 429: Too Many Simultaneous Requests You sent a sudden spike of simultaneous requests. PhantomJsCloud can handle hundreds of simultaneous requests, but we require you to gracefully increase the number of concurrent requests over time, not send a sudden spike. Please increase the number of your simultaneous requests according to the schedule shown in the 'Testing and Performance Optimization' section of the docs page. (add +1 simultaneous requests every 3 seconds, or +10 simultaneous every 30 seconds). You may retry this request immediately, with no modifications.
    • 500: Internal Server Error The PhantomJsCloud instance suffered an internal error. You can retry your request immediately, without modifications. If errors still occur, these are the known causes:
      1. More time needed, retry with larger pageRequest.requestSettings.maxWait value.
      2. An incompatible webfont is causing PhantomJs to crash, try blacklisting any font resources (.otf, .ttf, .woff) for example:
        pageRequest.requestSettings.resourceModifier:[{regex:'.*ttf.*|.*otf.*|.*woff.*',isBlacklisted:true}]
        .
      If you still have problems, please submit your request to Support@PhantomJsCloud for diagnosis.
    • 502: Bad Gateway Your request did not reach PhantomJsCloud due to a network failure. You can retry your request immediately, without modifications. If errors still occur, see the "502 Bad Gateway" Troubleshooting item above.
    • 503: Server Too Busy SERVER TOO BUSY: The serer is temporarily overwhelmed with other requests, and it's request backlog is very large. We are returning this to you to prevent risk of a http timeout occurring instead. You may immediately retry your request. Support@PhantomJsCloud.com has been notified and will investigate. You may retry this request with no modifications.

  • Content StatusCode: When we retrieve the target URL, we store it's statusCode for you to inspect. This is available via the content.statusCode. Be aware that while the Content is in an error state, the PhantomJsCloud Response HTTP StatusCode is still 200 (valid).
HTTP Response Headers

If obtain results in JSON format, there is a great deal of useful metadata that is returned. If you return your captured results in a different format (PDF, JPEG, HTML, etc) we provide the most important of these metadata in the form of HTTP Response Headers.

  • pjsc-billing-credit-cost: The total cost of this capture.
  • pjsc-billing-daily-subscription-credits-remaining: The number of Daily Subscription Credits your account has remaining.
  • pjsc-billing-prepaid-credits-remaining: The number of Prepaid Credits your account has remaining.
  • pjsc-billing-total-credits-remaining: The total number of Credits your account has remaining (Subscription + Prepaid)
  • pjsc-content-name: If you were to save the response payload as a file, this is a suggested name for the file. Example:content.jpeg
  • pjsc-content-status-code: See the "Debugging Page Errors: Status Codes" section above for a description of "Content StatusCode"
  • pjsc-content-url: The final URL (after redirects) that was captured and returned to you.
  • pjsc-backend-id: The PhantomJsCloud instance that handled your request. Provided for support (debug) purposes.

If you feel additional metadata would be useful if returned as part of the response headers, please let us know...



Geolocation

Geolocation (Static IP and Random IP)

Geolocation lets your requests come from a specific geographic location. We currently support two forms of geolocation:

  1. Static IP from the USA
  2. Random IP address from a chosen country (more than 12 choices)

Here are example request JSON showing how to do the two forms:


//anonymous proxy from Netherlands
{ url:"https://phantomjscloud.com/examples/helpers/requestdata", proxy:"anon-nl"}
//static IP from USA
{ url:"https://phantomjscloud.com/examples/helpers/requestdata", proxy:"geo-us"}

Both forms of Geolocation are performed via our proxy solution. Please read the proxy docs for more details on how to perform Geolocation.

For an updated list of countries we support for Random IP locations, click here. Here is the list as of June 2019:


any	"Worldwide (Global)"
au	"Australia"
br	"Brazil"
cn	"China"
de	"Germany"
es	"Spain"
fr	"France"
gb	"Great Britain"
in	"India"
jp	"Japan"
nl	"Netherlands"
sg	"Singapore"
th	"Thailand"
us	"United States"
                            


Testing and Performance Optimization

Rate-limiting Requests

Using Node.js? If you use the official PhantomJsCloud Node.js API Client Library you do not need to rate-limit requests. Autoscaling is handled automatically.

PhantomJsCloud automatically scales it's capacity based on demand, but it still takes a few seconds for the additional capacity to come online when there are spikes in demand. To ensure graceful capacity ramp up, please follow the following guideline:

  1. Start with 10 Parallel Requests
  2. Add +1 Parallel requests every 3 seconds
  3. If you get a HTTP ERROR code 429 or 503 delay adding additional parallel requests for 45 seconds.
  4. If your number of parallel requests falls under your "max" for more than 60 seconds, lower your max to your current (so you if you need more parallel requests, add +1 every 3 sec as per #2).
Optimizing your requests

Wait Interval: By default we set pageRequest.requestSettings.waitInterval=1000 (1 second). This padding allows waiting for AJAX or css animations before rendering. However if you know your page does not require this wait interval, setting waitInterval=0 will reduce render time (and price) by 1 second.