I am reaching out about the data extraction with DoxCycle. I have noticed that an update has been pushed last August, so I went to test it out on a 2024 and a 2025 file. I still noticed some issues with the data extraction accuracy or some issue with not extracting data on some slips. I wonder if you have a suggestion to improve the data extraction (other than the quality of the document, which we have limited control on).
Also, for the OCR updates:
Is it layout/template‑fixes, or has there been any AI/layout‑aware OCR added (e.g., models that “understand” form structure)?
If not, are there plans to incorporate more advanced document AI / layout‑based recognition (for example, what Microsoft’s Form Recognizer, Google Document AI, or models like LayoutLM do)?
Would the timing of these updates planned before the next tax season?
Data extraction accuracy is reliant on a number of factors including the quality of the document image, the accuracy of the OCR in picking up all of the data and a whole lot of programming logic. We are currently working on replacing some of the hard coded data extraction logic with a data extraction model that will be more tolerant of slip changes over time and is optimized for DoxCycle’s OCR.
For the OCR:
We replaced the OCR supplier in August of last year (2024) when our contract expired because we could not justify the price increase the previous supplier was asking for. The accuracy of the new OCR in testing was as good or better than the old one, but we acknowledge that with any change there is a risk of introducing new bugs. There is still some fine tuning we need to do related to the OCR.
We have had some preliminary discussions about moving to an AI based LLM but no decisions have been made. We have a lot of ground to cover to understand the upfront/ongoing costs, effort and level of customer interest.
Yes, assuming there are no big surprises that divert resources, we are planning to release DoxCycle updates to improve both the document classification and data extraction accuracy in time for the coming tax season.
Thanks! I too look forward to it. I use DoxCycle with almost all my files, mainly because not all slips are on CRA when returns are filed and also to have a link between data and the return, which saves a step when looking for something.
I discussed it with other firm owners. I am pretty sure a lot of us would be willing to pay a Premium for a perfect or almost perfect OCR. It would save so much time in labor (which is hard to find during tax season).
Right now, the American companies like Soraban and Stansford tax charge a fee per returns for an organized slip collection and data extraction.
We can’t rely on CRA for timely slips, and the AFR really just serves as verification.
A great OCR would be the best step in automating tax return preparation and would be a great competitive advantage for TaxCycle.