I have been recently tasked with re-writing up all the information from a SharePoint Online site Library into a document bible. I was searching the wide web and to my surprise I couldn’t find something that will automatically convert/export all the SharePoint site pages into doc files. Copying every single page with a total of 150 is a painful process! Having to deal with styles that are being transferred and losing time for searching every single page is not fun at all... So, I decided to build my own PowerShell script that will achieve this, since I wasn’t a collection Administrator.
In this article, we will be using the CSOM API and the REST service to achieve the desired solution. Why are we using two services for this script? Because CSOM will help us with collecting all the pages from a site and REST with scrapping the HTML code which will be converting it to DOC later.
First you will need to install the Client Components SDK before starting anything, this class is very useful for providing the credentials and associate it with the request we are going to make: https://www.microsoft.com/en-us/download/details.aspx?id=35585
Define the site URL in a variable:
Once done, in our first part of PowerShell script, we will be adding the credentials for use with the site URL we are crawling from:
From above you can see that we are injecting the .NET assembly in our PowerShell session so we can make use of the Client Context and SharePoint Online Credentials objects. The credentials are not sent plainly but rather as security token and stored in a cookie and I assume this is what everyone wants? Fast, simple and secure. I’m sure the code above can be shortened a lot more but I wanted to show you how exactly the process works in PowerShell.
Once the credentials are inputted using Get-Credentials and new Instance of SharePoint Online Credentials class created, they are passed with appropriate constructor values.
Now we will load the site in the context and return a message for successful connection.
To do this, we will create new variable $web and load the context site:
The $context.Load() is similar to SP.ClientContext.load from the website article and you can define more properties to retrieve from the server.
To return successful connection to the server we will use the same $context and execute a query where it will check if the ServerObject value of $context is Null, and if not it will display server connection successful:
Awesome aye? 😉
Now, we want to get a list of the pages that we want and store that into an array. We will need this so we can crawl html code from each page using a loop.
In my case, since a library is not defined as site but as a List and pages as items, we will need to define and pass these objects to the context and a better way of doing this is creating a function and loop to through all the items:
The first line will get the List Title parameter, which is the site page that has all our pages into it. The second line defines the CreateAllItemsQuery object which is then passed under a variable and next loaded and executed by the Context to return the items.
We are now ready to use this function for our loop and store each item (site page) into an array:
Simple I say, the first loops through the item object properties and the second loops through the values of the properties, this gets the page file name/item name. Once done, we will dispose of the context connection as we will not be using it anymore now.
Note. You can find more properties to return on the Microsoft article.
Now we have a list of all the pages that we want to download, let’s get them downloaded!
For this part, we will be using the REST API Service. Which is useful because the ASP.NET server will not allow copy of .aspx files but we can copy the html code that is rendered.
REST, or Representational State Transfer, uses HTTP verbs to perform basic create, read, update and delete (CRUD) operations against a web service endpoint. RESTful web services, based on the OData standard, can return data in either an XML and/or JSON format. SharePoint’s REST endpoint is located under the https://[…]/_api/ virtual directory of a given site. If your URL looks different when using SharePoint online, don’t worry so long the site name is specified at the end it will still work. Example: https://some-cloud-id.sharepoint.com/sites/*site_name*
From here is very simple, we will declare a loop and use the System WebRequest for retrieving the html output.
The page file name variable will need to be edit as the web request does not accept spaces and will get an error:
The data variable contains the HTML output, after which we will write that to a file, either html or doc. What I have done, is output it to HTML file and then use another loop to itinerate through all the html files and convert them properly to .docx. You can use this script from Github to achieve this if you change a couple of lines and then use Get-ChildItem to loop through all the html files.
That’s all done now, we can now use this same script to automate for all the sites and massively download all the pages without the need for copying them manually.
Apr, 24th, 2018
Apr, 15th, 2018
Jul, 26th, 2021
Oct, 07th, 2018