PhantomJs Cloud Documentation

These docs are a work in progress. If you have any questions or comments, please let us know.

The PhantomJs Cloud API is organized around a REST-like, "HTTP Endpoint" WebService. The requests are made by submitting a request.json payload describing your UserRequest, and we send back your results (as JPEG, or in another renderType format you specify in your UserRequest) along with a HTTP response code indicating any errors, and HTTP response headers to inform you of important metadata (page cost, etc).

API Client Libraries

These API Client Libraries fully support all PhantomJs Cloud features.
HTTP Endpoint

This is the most direct way of interacting with PhantomJs Cloud. If you are comfortable composing GET or POST HTTP-Requests directly in your language of choice, you can use our HTTP Endpoint.

Node.js

If you use Javascript or Typescript, you can use our Official NPM Module. This can also be used in browsers via the Browserify and Webpack projects.

Other Language Examples

These examples show how to leverage the HTTP Endpoint from various languages.
Guidelines when using these "Other Language" examples
Authentication

The examples use the demo ApiKey a-demo-key-with-low-quota-per-ip-address to make requests (Located in the "url line" section of the example). This demo key is limited to 100 requests per day. You can create a Free account to get 500 Pages/Day at Dashboard.PhantomJsCloud.com. Then when following these examples, replace the demo key with the ApiKey found on your account dashboard page. For example: https://PhantomJsCloud.com/api/browser/v2/ak-012345-abcde-012345-abcde-012345/

HTTP Response Status Codes and Headers

Most languages provide access to the response statusCode and headers. Please refer to the "Basic Troubleshooting" section (below) for descriptions of these.


Usage FAQ

These are fully described in the HTTP Endpoint docs, but is pretty-formatted here:

  • plainText For Web Content Scraping. If you need a page's fully rendered DOM, simplly saving the HTML source won't cut it. Use our REST API with this output format (the default) and scrape the resulting HTML as usual. The page's JavaScript will be fully executed, and all DOM transformations completed.
  • jpg, jpeg, png For visual inspection. If you need to generate page previews, archive screenshots, or create thumbnails, this renders the page sends the result as JPEG or PNG.
  • pdf For Archiving and Reports. Create a PDF of the page or uploaded HTML, including all images, svg graphics, headers and footers.
  • html Returns the target page in it's "native" form, including all response headers intact. Useful for generating static versions of your Single-Page-App / AJAX Data, or for proxied requests. Very useful for SEO of Facebook / Twitter / Yahoo / Bing web bots.
  • script Manipulate and extract data from any webpage. Using the Script Injection feature, you can execute arbitrary JavaScript on any page, and can also specify exactly what data you want returned from your request. This is a powerful feature, even allowing you to construct custom API Endpoints! To use this properly, you need to read the HTTP Endpoint docs regarding Script injection

Currently we have limited proxy functionality. If you have your own 3rd party proxy provider, you can specify the proxy in your userRequest settings when making a request to our HTTP Endpoint. If your proxy provider needs to Whitelist an IP Address, you can use api-static.phantomjscloud.com. We will be greatly improving the proxy options in the next few weeks, and we will send an announcement mail when the new proxy option is ready for use.

This is a temporary solution for proxy server needs. We will be developing a better solution in the coming months.

Use the http://api-static.phantomjscloud.com when you need to Whitelist our IP Addresses (and with Proxy Servers). However keep in mind that static IP addresses do not auto-provision, so this endpoint should not be used for bulk use. Contact Support if you need a private static instance.

If you use our Node.js Client API Library, you can submit as many requests as you wish and your calls will gracefully be handled. This is because the Node.js Client API Library has built-in rate-limiting and autoscale handling. If you need to do batched requests with other languages, please refer to the Testing and Performance Optimization section below.

Once a resource is loaded, it is normally cached. This means that another page request that loads the same resource will not make a network request, and thus there will not be any load for us to record in the pageResponses.events section.

To force all resources to load, you should pass the pageRequest.requestSettings.clearCache:true parameter. This is also helpful if you are making changes to the resource and want to make sure the newest version is the one used by your call to PhantomJs Cloud.

Below is an example of what you would see in the pageResponses.events section of your JSON response, if the resource being loaded is https://example.com/resource.css:

{
"key": "resourceRequested",
"time": "2016-05-24T15:36:50.376Z",
"value": {
"resourceRequest": {
"headers": "OUTPUT SUPPRESSED (Disabled to reduce verbosity of your JSON.  You can enable by removing the related entry in your pageRequest.suppressJson settings)",
"id": 172,
"method": "GET",
"time": "2016-05-24T15:36:50.376Z",
"url": "https://example.com/resource.css"
}
}
},
{
"key": "resourceReceived",
"time": "2016-05-24T15:36:50.482Z",
"value": {
"resourceResponse": {
"body": "",
"bodySize": 3654,
"contentType": "application/javascript",
"headers": "OUTPUT SUPPRESSED (Disabled to reduce verbosity of your JSON.  You can enable by removing the related entry in your pageRequest.suppressJson settings)",
"id": 172,
"redirectURL": null,
"stage": "start",
"status": 200,
"statusText": "OK",
"time": "2016-05-24T15:36:50.482Z",
"url": "https://example.com/resource.css"
}
}
},
{
"key": "resourceReceived",
"time": "2016-05-24T15:36:50.492Z",
"value": {
"resourceResponse": {
"contentType": "application/javascript",
"headers": "OUTPUT SUPPRESSED (Disabled to reduce verbosity of your JSON.  You can enable by removing the related entry in your pageRequest.suppressJson settings)",
"id": 172,
"redirectURL": null,
"stage": "end",
"status": 200,
"statusText": "OK",
"time": "2016-05-24T15:36:50.491Z",
"url": "https://example.com/resource.css"
}
}
},
{
"key": "resourceReceived",
"time": "2016-05-24T15:36:50.492Z",
"value": {
"url": "https://example.com/resource.css",
"status": 200
}
},

These are fully described in the HTTP Endpoint docs, but is pretty-formatted here:

  • plainText For Web Content Scraping. If you need a page's fully rendered DOM, simplly saving the HTML source won't cut it. Use our REST API with this output format (the default) and scrape the resulting HTML as usual. The page's JavaScript will be fully executed, and all DOM transformations completed.
  • jpg, jpeg, png For visual inspection. If you need to generate page previews, archive screenshots, or create thumbnails, this renders the page sends the result as JPEG or PNG.
  • pdf For Archiving and Reports. Create a PDF of the page or uploaded HTML, including all images, svg graphics, headers and footers.
  • html Returns the target page in it's "native" form, including all response headers intact. Useful for generating static versions of your Single-Page-App / AJAX Data, or for proxied requests. Very useful for SEO of Facebook / Twitter / Yahoo / Bing web bots.
  • script Manipulate and extract data from any webpage. Using the Script Injection feature, you can execute arbitrary JavaScript on any page, and can also specify exactly what data you want returned from your request. This is a powerful feature, even allowing you to construct custom API Endpoints! To use this properly, you need to read the HTTP Endpoint docs regarding Script injection

If there is a problem with your script you need to debug, use the outputAsJson:true parameter then search the output for the term browserError which will be under pageResponse.events. This should give you an idea of any syntax errors your script may have caused.


Advanced Scenario Samples

These Advanced Scenario Samples show and explain the request.json you would submit to PhantomJs Cloud via your language of choice. Please familarize yourself with the Basic Examples for your language of choice (above), and the HTTP Endpoint Docs before going through these!

If you have a request for another scenario sample, please let us know!

We created an example button-click page to help illustrate this process. The following request.json will click the What is the time? button and wait for the demo_result element to become populated before taking a screenshot.

{
        "url": "https://PhantomJsCloud.com/static-samples/button-click.html",
        "renderType": "jpeg",
        "scripts": {
                "domReady": [],
                "loadFinished": [
                    "https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js",                     
                    "$('button#dateBtn').click();",                     
                    "_pjscMeta.manualWait=true;",
                    "setInterval(function(){ if($('#demo_result').text()!=null){_pjscMeta.manualWait=false;} },200)",
                    ]
        }
}

In the above request, we inject a series of scripts that:

  1. Load jQuery
  2. Click the button
  3. Specify that we should not capture the page until _pjscMeta.manualWait is set back to true
  4. Wait for the demo_result element to be populated and then signal we can capture the page.

Alternatively, we could choose to return the contents of the demo_result element after it is populated by the following request.json:

{
        "url": "https://PhantomJsCloud.com/static-samples/button-click.html",
        "renderType": "script",
        "scripts": {
                "domReady": [],
                "loadFinished": [
                    "https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js",
                    "$('button#dateBtn').click();",
                    "setInterval(function(){ var txt=$('#demo_result').text(); if(txt!=null){_pjscMeta.scriptOutput=txt;} },200)",
                    ]
        }
}

In that above request, we don't need to explicitly manualWait because when choosing the script renderType, PhantomJs Cloud will automatically delay capture until the _pjscMeta.scriptOutput variable is set. (Which could happen by you explicilty setting that variable, or by returning a value from your injected script)

We created an example button-click-navigate page to help illustrate this process. The following request.json will click the Nav to another page button and wait for the navigation to take place, then performs the same workflow as the prior "Page Automation: How can I click a button" Advanced Scenario Sample: click the dateBtn Button and wait for demo_result element to become populated before taking a screenshot.

{
        "url": "https://PhantomJsCloud.com/static-samples/button-click-navigate.html",
        "renderType": "jpeg",
        "requestSettings":{
            "maxWait":25000,
            "clearCache":true,
        },
        "scripts": {
                "domReady": [],
                "loadFinished": [
                    "https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js",                     
                    "if(location.pathname.indexOf('button-click-navigate.html')>0){ $('button#navBtn').click(); }else{ $('button#dateBtn').click(); }",
                    "_pjscMeta.manualWait=true;",
                    "setInterval(function(){ if($('#demo_result').text()!=null){_pjscMeta.manualWait=false;} },200)",
                    ]
        }
}

In the above request, we inject a series of scripts that:

  1. Load jQuery
  2. Click the navBtn if we are on the button-click-navigate.html page, and dateBtn otherwise.
  3. Specify that we should not capture the page until _pjscMeta.manualWait is set back to true
  4. Wait for the demo_result element to be populated and then signal we can capture the page.

This workflow relies on the fact that when a new page is navigated to, all the pageRequest.scripts are re-triggered. This means that our scripts can perform different functionality based on what page we are on or what cookies are loaded.

Also be aware that for the button-click-navigate.html page the _pjsc.manualWait is never set back to true. When we navigate to a new page all the previous page's local state/script variables are lost and ignored.

Building off what you learned in the above "How can I load a page, navigate to another..." sample, Here is an example request.json that will login to LinkedIn and capture a screenshot of your home page:

{
        "url": "https://www.linkedin.com/uas/login",
        "renderType": "jpeg",
        "scripts": {
                "domReady": [
                    "https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.4/jquery.min.js",  
                    "if(location.pathname==='/uas/login'){ _pjscMeta.manualWait=true; $('#session_key-login')[0].value='USER@EXAMPLE.COM'; $('#session_password-login')[0].value='PASSWORD'; $('#login')[0].submit(); }"
                    ]
        }
}

As you can see, it is very easy to do practical automation with PhantomJs Cloud's Script Injection. If you are a developer looking to create your own automation script, I recomend using Chrome to construct your steps (you can load external scripts via Chrome Dev Tool's built in $.getScript() feature) and then you can copy these to PhantomJsCloud scripts without any modifications.



Basic Troubleshooting

502 Bad Gateway Errors

If you are getting 502 Bad Gateway errors frequently, be sure that the ExpectContinue header to false. Some platforms (C# and Curl) set this to true by default, so be sure to change it!

If you get 502 errors frequently and this does not solve your problem, please let us know.

Debugging Page Errors: Status Codes

When processing the results you receive from PhantomJs Cloud, be sure you pay attention to the two types of statusCode results. Be aware that these two statusCodes have separate meanings.

  • Response StatusCode: The HTTP StatusCode returned from PhantomJs Cloud will normally be 200 unless there was a problem processing the request. For example: if the target server is offline or if the request is invalid. If there is a timeout requesting the target URL a 424 Failed Dependency error will be returned. When a Response Failure is sent to you, we try to provide useful data in the statusCode_Help parameter.

    Here is a general description of the Response Status Codes we send and what they mean.

    • 200: OK The target page was captured properly.
    • 400: Bad Request Your request had an error in it. Fix it before resubmitting.
    • 401: Unauthorized You are using an invalid Api Key. Please check for typos, or create an account.
    • 402: Payment Required Your account is out of credits. Login and either upgrade your Subscription or add Prepaid Credits.
    • 403: Forbidden Your request was flagged due to abuse. Read the response for steps you should take to resolve the situation.
    • 424: Failed Dependency The target page was not reachable (the request timed out). Check and make sure your target URL is valid before retrying, or make sure your requestSettings.maxWait parameter is set to be long enough.
    • 500: Internal Server Error The PhantomJs Cloud instance suffered an internal error. You can retry your request immediately, without modifications. If errors still occur, these are the known causes:
      1. More time needed, retry with larger pageRequest.requestSettings.maxWait value.
      2. An incompatible webfont is causing PhantomJs to crash, try blacklisting any font resources (.otf, .ttf, .woff) for example:
        pageRequest.requestSettings.resourceModifier:[{regex:'.*ttf.*|.*otf.*|.*woff.*',isBlacklisted:true}]
        .
      If you still have problems, please submit your request to Support@PhantomJsCloud for diagnosis.
    • 502: Bad Gateway Your request did not reach PhantomJs Cloud due to a network failure. You can retry your request immediately, without modifications. If errors still occur, see the "502 Bad Gateway" Troubleshooting item above.

  • Content StatusCode: When we retrieve the target URL, we store it's statusCode for you to inspect. This is available via the content.statusCode. Be aware that while the Content is in an error state, the PhantomJs Cloud Response HTTP StatusCode is still 200 (valid).
HTTP Response Headers

If obtain results in JSON format, there is a great deal of useful metadata that is returned. If you return your captured results in a different format (PDF, JPEG, HTML, etc) we provide the most important of these metadata in the form of HTTP Response Headers.

  • pjsc-billing-credit-cost: The total cost of this capture.
  • pjsc-billing-daily-subscription-credits-remaining: The number of Daily Subscription Credits your account has remaining.
  • pjsc-billing-prepaid-credits-remaining: The number of Prepaid Credits your account has remaining.
  • pjsc-billing-total-credits-remaining: The total number of Credits your account has remaining (Subscription + Prepaid)
  • pjsc-content-name: If you were to save the response payload as a file, this is a suggested name for the file. Example:content.jpeg
  • pjsc-content-status-code: See the "Debugging Page Errors: Status Codes" section above for a description of "Content StatusCode"
  • pjsc-content-url: The final URL (after redirects) that was captured and returned to you.
  • pjsc-backend-id: The PhantomJs Cloud instance that handled your request. Provided for support (debug) purposes.

If you feel additional metadata would be useful if returned as part of the response headers, please let us know.



Testing and Performance Optimization

Rate-limiting Requests

Using Node.js? If you use the official PhantomJs Cloud Node.js API Client Library you do not need to rate-limit requests. Autoscaling is handled automatically.

PhantomJs Cloud automatically scales it's capacity based on demand, but it still takes a few seconds for the additional capacity to come online when there are spikes in demand. To ensure graceful capacity ramp up, please follow the following guideline:

Add a maximum of +1 simultaneous request every 2 seconds.

Examples:

  • after 1 second: 1 simultaneous request
  • after 2 seconds: 2 simultaneous requests
  • after 3 seconds: 2 simultaneous requests
  • after 4 seconds: 3 simultaneous requests
  • after 10 seconds: 6 simultaneous requests
  • after 60 seconds: 31 simultaneous requests
Simulating out-of-credits

You can use the test ApiKey ak-test-key-no-credits to simulate what it will be like when your account is out of credits. You will receive a HTTP Response of 402, such as in this example request.

Image Quality: By default we set image quality to 70, which is used for JPG and PDF rendering. This offers a reasonable trade-off for file size vs image quality. We suggest that you do not use PNG output as the file size is very large.

Optimizing your requests

Wait Interval: By default we set pageRequest.requestSettings.waitInterval=1000 (1 second). This padding allows waiting for AJAX or css animations before rendering. However if you know your page does not require this wait interval, setting waitInterval=0 will reduce render time (and price) by 1 second.

Latest News

  • 20160522: Scripting improvements
    • Changed script execution to occur serially, which now allows including dependencies prior to your code.
    • Added _pjscMeta.manualWait for scripts to specify when to capture (render) the page
  • 20160505: Node.js API Client Library added. Also adding other language-specific samples.
  • 20160504: Added userResponse.originalRequest, which shows the original input values without defaults applied.
  • 20160416: requestSettings.clearCache improvements and better invalid request error messages.
  • 20160405: Improved and documented the two types of StatusCodes. see the "Debugging Page Errors: Status Codes" section above.
    • This means you will now get 4xx errors for failed requests, which you did not before.
  • 20160401: Api Endpoints now support HTTPS requests.
  • 20160328: Changed static ip server to api-static.phantomjscloud.com
  • 20160326: Billing system now linked, so rate-limiting has been removed. (Process as many requests as fast as you want)
  • 20160325: Added Billing metadata to HTTP Response Headers.
  • 20160320: Added error details when userRequest JSON is invalid.
  • 20160107: Fix PDF Viewport size (too zoomed in) and PDF viewing in Chrome.
    • For now, it's best to set PDF dimensions in pixels (px) via the renderSettings.pdfOptions.width and .height parameters.
  • 20151217: Add Proxy Server support. See UserRequest.proxy for details. Removed Geolocation hack.
  • 20151216: Added Endpoints with Static IP Addresses for users who need to Whitelist our IP Addresses for their use (and with Proxy Servers).
  • 20151126: Allow passthrough of response headers via pageRequest.renderSettings.passThroughHeaders:true
  • 20151119: CORS and JSONP support. Improve injectedScript performance.
  • 20151023: Add Script injection and metadata. See PageRequest.scripts for details. Also added lots more examples.
  • 20151022: Add custom Cookie and additional Header support:
    • pageRequest.requestSettings.clearCookies:boolean //clear cookies prior to loading
    • pageRequest.requestSettings.cookies:[] //set cookies for any domain
    • pageRequest.requestSettings.customHeaders:{} //set headers sent with all requests
  • 20151020: Add per-page browser cache control:
    • pageRequest.requestSettings.clearCache:boolean //clear cache prior to loading

Roadmap

  • get running on a massively parallel cloud backend (Complete as of 20150901)
  • v1 feature parity (Complete as of 20151023)
  • new creditBalance/billing system (Complete as of 20160326)
  • new website / dashboard. (Complete as of 20160321)
  • Spider + Batch api system.
  • Improved Proxy service to suport proxy + geolocation needs.
  • SDK / API for popular languages