While the Microsoft documentation is quite extensive and covers nearly any question you might have about Knowledge Mining with Azure Cognitive Search, we thought it would be helpful to collect some of the more salient questions into a single article for easy reference.
Over the past few months, I’ve been on what seems like a nonstop whirlwind of Knowledge Mining activity. And whether it’s delivering projects using Azure Cognitive Search, speaking with clients about the potential for Knowledge Mining, or hitting the road to give briefings and technical training with Microsoft, many of the same questions seem to come up. I find that while the power of Knowledge Mining is immediately evident, the specifics can be a little difficult to pin down when you’re first getting started. With this in mind, I figured it would be worthwhile to pull together a list of common questions and answers to help others get their arms around what Knowledge Mining in Azure might look like for them.
So, here are “12 Common Questions about Knowledge Mining and Azure Cognitive Search.”
The blob indexer can extract text from the following document formats:
Azure Cognitive Search limits how much text it extracts depending on the pricing tier:
If your indexer runs out of characters before it runs out of content, a warning is included in the indexer status response in the Azure Portal identifying documents that are partially indexed for this reason.
It is possible to index data in formats other than documents/files.
(Source: https://docs.microsoft.com/en-us/azure/search/search-indexer-overview )
You can upload your data directly to an Azure Indexer using a data “push”.
This technique involves use the available REST API to POST data directly into the index. This is a flexible approach which supports any JSON-formatted data, however since this approach updates the index directly, your data will not benefit from the cognitive services that are normally applied by the indexer.
(Source: https://docs.microsoft.com/en-us/azure/search/search-what-is-data-import)
Yes, you can. The simplest way to do this is to build a service (e.g. using Azure Functions) that will read the custom document type and will store its contents as JSON data (i.e. in Blob Storage) or that will push it into a supported Azure database platform.
Following this approach allows you to leverage libraries for multiple different languages.
(Source: https://docs.microsoft.com/en-us/azure/azure-functions/supported-languages)
Azure Functions can be helpful in minimizing your cost in Azure usage, since your function will only run when needed, and it won’t use any of the premium cognitive or other services that sometimes drive cost in Knowledge Mining implementations.
Because it’s a flexible, multi-language environment, you can leverage existing libraries for various file types that aren’t supported by Cognitive Search out of the box (e.g. engineering or CAD files, media files, etc.) Since you control the “cracking” process, you can make as flexible as you see fit. For instance, it will allow you to handle groups of files as a single “entity” in the index. Say you had all information on a client in a single folder. You could write Azure Function code that reads the files in that folder and writes their info out as a single JSON file. Azure Search could then ingest that one file and allow your users to search against it. The contents of such a file could even have come from multiple places, including services. You could start with the contents of a folder, as above, but augment that information with data in your CRM.
The sky really is the limit.
(Sources: https://docs.microsoft.com/en-us/azure/search/cognitive-search-working-with-skillsets and https://docs.microsoft.com/en-us/azure/search/cognitive-search-defining-skillset)
You can upload your data directly to an Azure Indexer using a data “push”.
This technique involves use the available REST API to POST data directly into the index. This is a flexible approach which supports any JSON-formatted data, however since this approach updates the index directly, your data will not benefit from the cognitive services that are normally applied by the indexer.
(Source: https://docs.microsoft.com/en-us/azure/search/search-what-is-data-import)
Yes, you can.
These are attributes of the Index and can be controlled through the Azure Portal UI or by creating or updating your indexer via HTTPS POST or PUT. One tool we find very useful for this is POSTMan (https://www.postman.com)
Useful links:
Yes and No. Currently there is no out-of-the-box support for these environments, however we have custom code that can help with this need. Also, we have spoken to the Microsoft product team and they are aware that people would love to have this feature. You never know but they may already be working on this…
Contact us if you’d like to discuss how to build such a solution or stay tuned for any announcements from Microsoft should they decide to add this feature.
Yes, you can.
Keep in mind that the security model behind this service is still in flux, so there may soon be simpler ways to do this. At the same time, Microsoft has to be careful when making deep changes to the underlying technologies (i.e. Lucene and Solr) because they don’t want to impact performance, stability or the correctness of results.
The current approach to securely hiding select results from users uses native filtering to “trim” results from the returned set. While this technique has some challenges (e.g. updating permissions is a bit of a pain), it should have little to no impact on your app’s performance.
How it works: you add a field to each entry in the index identifying the “Principals” that are allowed to view it. This allows you to implement strict filtering against the current user’s group/role membership(s).
For instance: say you have two types of users, “public” and “private”. You tag each entry in the index with one or the other of these principals. When a user submits a search, your back-end service pulls the current user’s session and grabs group membership information. The groups to which he or she is a member are then added as filters on the query. This then eliminates any “private” search results from being returned to someone who isn’t a member of the “private” group (or who isn’t logged in at all).
(Source: https://docs.microsoft.com/en-us/azure/search/search-security-trimming-for-azure-search)
When pricing out a possible Azure Search implementation, you’re going to have to look at the following:
Some principles to keep in mind, from our experience:
References:
We recommend that you start with a representative sub-set of your documents. That means grabbing files that represent the range of file and content types you would have in a production rollout. Set up indexing to where you have the information you want, then track the Azure cost of indexing that whole example set.
From that cost, you can multiply it out to find the cost for your whole document set. This will of course be an estimate but it should give you an “order of magnitude” idea of the final costs.
Next, determine how much “churn” you will have on your document set. How often will you run the incremental indexing, and how many (i.e. what percentage) of your documents will have been changed, deleted, created? This should allow you to roughly calculate your incremental indexing costs.
Add the initial indexing cost to the incremental cost times the period you’re estimating and that should give you a rough idea of the total indexing cost for that period.
References:
The maximum number of Lucene documents is roughly 25 billion per index, so there are no practical limits on the number of documents allowed.
The maximum document size when calling an Index API is approximately 16 megabytes.
Some limits vary by Azure Region, so to confirm if any will apply to your implementation, please invoke the Service Statistics service within your target region and it will return information on any limits that may apply.
(Source: https://docs.microsoft.com/en-ca/rest/api/searchservice/get-service-statistics)
There is a “complexity” limit that applies to composite elements, though this is rarely an issue. As per Microsoft: Indexers will enforce the upper limit of 3,000 elements across all complex collections per document starting with the generally available API version “2019-05-06” and onwards. That means, for instance, that if a Zip file is indexed, it should contain less than 3,000 items.
(Source: https://docs.microsoft.com/en-us/azure/search/search-limits-quotas-capacity )
This toolset can do amazing things, but power comes at a cost. Before committing to a full production implementation of Azure Cognitive Search, you’re likely going to have to determine the cost/benefit relationship.
That said, there are many ways to shave off wasted or unnecessary costs. Start by following these basic tricks:
There are no one-size-fits-all solutions here. Please reach out to us and we can help you customize your Azure Cognitive Search solution to fit your user and organizational needs, as well as your pocketbook.
(Source: https://azure.microsoft.com/en-us/services/storage/files/)
To learn more about MNP’s Knowledge Mining solutions, contact us today.