I need a web application developed that allows to extract text from PDF pages (magazine pages) into XML format.
* Extraction of text from one or more PDF pages
* The final result needs to be formatted html text
I also require editing capabilities - I do not want to extract just the plain text, but i need the text to keep a certain format.
* The extracted text does NOT need to have the same visual format as the source PDF text. It is enough if just the text is extracted.
I need to retain a similar formatting - text should be text and headlines should be recognizable as headlines. It is enough to separate between 2 or 3 different types of font sizes (headline, paragraph ... ). The extracted text only needs to have one font.
BUT: The text need to be formatted according to the PDF, meaning
- Text shall stay text
- Headlines shall stay headlines
This needs to be automatically recognized to a certain degree. I want to keep thee required user interaction as low as possible.
There are tools that allow to analyze the font and the text size during extraction which you need to you use. These could be tools such as:
[login to view URL]
[login to view URL]
I am open other suggestion too, though.
For the final application I will purchase the required license then.
* The user shall be able to modify the extracted text, eg. add blank lines to it, or increase the font for selected text and save the changes again.
* The user shall be able to select an area of the selected text to add a unique id tag to it, so that this area can be accessed later thru its ID.
* The images of a page need also to be extracted (reduced to a fixed max. size) and placed at the end of the extracted page.
Plus: a very simple user management is required.
Server: I am not tied to a certain type of server (can be apache or windows).
An example can be provided to each bidding developer.