Fixing Crawling Issues in Sitecore Search: Handling Special Characters in Document IDs

Today, I will describe an issue I faced in SiteCore Search while crawling an URL containing multiple JSON nodes.

I created a source of type “API Crawler” to crawl JSON nodes from a specific URL. However, during the crawling process, I encountered the below issue:

Some nodes were not crawled due to an error.

To investigate, I ran a query to retrieve all the crawled results. Then, I compared them with the JSON data from the URL. This helped me find the missing nodes. The next step was to decide why these nodes were not being crawled.

I discovered that the property I used to map document IDs “Id” attribute in the entity contained special characters like “.” (e.g., “BE-HDVS41.43”).

To successfully crawl the data, document ID must match the regex pattern ^[a-zA-Z\d_-]+$.

After removing the characters that did not match this regex, all the items were successfully crawled.

Hope this will help someone.

Leave a comment