Excel (OneDrive, Office365)¤
The dataset needs the URL of a “share via link” sheet on Office 365/OneDrive as input. It will automatically construct a direct download URL, cache the download file handle it like an XLSX file in the Excel Dataset.
Notes¤
There are 2 types of URLs that can be shared:
Onedrive links look like https://1drv.ms/x/s!AucULvzmJ-dsdfsfgaIcyWP_XY_G4w?e=yx65uu
Onedrive (based one sharepoint, for businesses) links look like https://eccencagmbh-my.sharepoint.com/:x:/g/personal/person_eccenca_com/EdEMTEw1dclHiEZXyvy8P4YBit8wSyGsiwU5Kt__sQOZzw
The first type should always work is not recommended for this dataset. The second type requires to set up an application in Microsoft EntraID (formerly Azure Active Directory). EntraID: https://docs.microsoft.com/azure/active-directory/develop/v2-overview Instructions and examples can be found here: https://github.com/Azure-Samples/ms-identity-msal-java-samples/tree/main/3-java-servlet-web-app/1-Authentication/sign-in
After following the steps access to sharepoint/onedrive for business can be setup in the application.conf file for eccenca DataIntegration.
Example:
com.eccenca.di.office365 = {
authority = "https://login.microsoftonline.com/a0907dd1-f981-4c98-a8b9-1deb27bcf2cc/"
clientId = "4d14959d-3c62-4f90-a072-a96ca4b3fa9f"
secret = "Ceb8Q~QkMMV7TBK-ggB3nh22nUnqoDB1KTmkjj"
scope = "https://graph.microsoft.com/.default"
tenantId = "a0907dd1-f981-4c98-a8b9-1deb27bcf2cc"
}
Caching¤
The advanced parameter invalidateCacheAfter
allows the user to specify a duration of the file cache
after which it is refreshed.
A file based cache is created to avoid CAPTCHAs. During the caching and validation of the URL
access occurs with random wait times between 1 and 5 seconds.
The cache is invalidated after 5 minutes by default.
Parameter¤
URL¤
Link to the document (‘share with anyone having a link’ must be enabled).
- Datatype:
string
- Default Value:
None
Streaming¤
Streaming enables reading and writing large Excels files. Warning: Be careful to disable streaming for large datasets (> 10MB), because of high memory consumption.
- Datatype:
boolean
- Default Value:
true
Invalidate cache after¤
Duration until file based cache is invalidated.
- Datatype:
duration
- Default Value:
PT5M
Lines to skip¤
The number of lines to skip in the beginning when reading files.
- Datatype:
int
- Default Value:
0