PROJECT GOAL
The goal of the project is to create a bilingual (Polish & English) termbase with minimum 100,000 entries (general domain), which would be uploaded into MemSource CAT software and used during translations. This database needs to contain two sets of columns a) Polish term b) English term. No characters nor operators are allowed - just the actual terms. The difficulty here is finding and processing the input information.
INPUTS
Possible readymade databases like:
- online dictionaries
- dictionaries on USB stick, CD
OUTPUT FILE STRUCTURE
The basic structure of the output file needs to be this:
column A: Polish term
column B: English term
column C-X: second and next (if available) English term
EXPECTED PROPOSAL
1. How to find the right, high quality data. You also have an idea and technical skill to obtain that data in required quantity and quality.
2. Technology you would use.
3. Describe the process you would follow.
4. Describe expected outcomes.
IDEAL CANDIDATE:
1. Self-reliant, self-starter
2. Highly experienced in data analytics and/or data science
3. Great mathematical problem solver capable of building complex algorithms
ACCEPTANCE CRITERIA
1. Any row needs to consist of at least one Polish term and a corresponding, at least one English term.
2. There is a minimum of 100,000 VALID entries (Polish term & English term is 1 entry).
3. Source of data is revealed to me so I can do a quality check and approve it.
4. Polish and English terms are clearly separated with an operator or placed in different column.
5. Subsequent english terms are divided by a comma “,” OR and a semicolon “;” OR are placed in different columns of the same row.
6. Abbreviations like “mat.” or “chem.” denote subject matter areas and are redundant - should be ignored and excluded from output file.
7. No entries shorter than 3 letters (1 and 2 letters long).
8. No blank rows between rows.
9. Spot-check your work - I will spot-check random 1000 entries to ensure proper quality and structure.