NLP - YAP Alternative Evaluation
Alternative Evaluation for YAP¶
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.set_context('talk')
sns.set_style("white")
sns.set_palette('Set2', 10)
%matplotlib inline
In [2]:
gold_file = "dev.hebtb.lgold.conll"
joint_file = "joint.arc.zeager.i50.dev.hebtb.uninf.conll"
goldpipe_file = "pipeline.zeager.i32.dev.hebtb.lgold.conll"
pipe_file = "pipeline.zeager.i32.dev.hebtb.uninf.i28.conll"
def make_conll_df(path):
# CoNLL file is tab delimeted with no quoting
# quoting=3 is csv.QUOTE_NONE
df = (pd.read_csv(path, sep='\t', header=None, quoting=3,
names = ['ID', 'FORM', 'LEMMA', 'UPOS', 'XPOS', 'FEATS', 'HEAD', 'DEPREL', 'DEPS', 'MISC'])
# add sentence labels
.assign(sent = lambda x: (x.ID==1).cumsum())
# replace bad root dependency tags
.replace({'DEPREL': {'prd': 'ROOT'}})
)
df = df.merge(df[['ID', 'FORM', 'sent', 'UPOS']].rename(index=str, columns={'FORM': 'head_form', 'UPOS': 'head_upos'}).set_index(['sent', 'ID']),
left_on=['sent', 'HEAD'], right_index=True, how='left')
return df
gold, joint, goldpipe, pipe = map(make_conll_df, [gold_file, joint_file, goldpipe_file, pipe_file])
In [3]:
gold.head(30)
Out[3]:
In [4]:
joint.head()
Out[4]:
In [5]:
goldpipe.head()
Out[5]:
In [6]:
pipe.head()
Out[6]:
In [7]:
gold.groupby('sent').size().describe()
Out[7]:
Check mean sentence length
In [8]:
gold.groupby('sent').size().mean(), joint.groupby('sent').size().mean(), goldpipe.groupby('sent').size().mean(), pipe.groupby('sent').size().mean()
Out[8]:
Evaluate¶
Handwavy pseudo-algorithm
1. points = 0
1. For each sentence
1. g <- set(gold[FORM, UPOS, DEPREL, head_form])
1. t <- set(test[FORM, UPOS, DEPREL, head_form])
1. points += len(g.intersection(t)) / avg(len(gold, test))
Or in words - for each test sentence, get the number of correct tags using set intersection with the gold sentence, and then normalize by the sentence length.
The current normalization uses the avg length of the sentences between g and t
In [9]:
EVAL_COLS = ['FORM', 'UPOS', 'DEPREL', 'head_form']
def score(t, g = gold, columns = EVAL_COLS):
sent = t.iloc[0,10]
# get the correct gold sentence
g = g[g.sent==sent]
# value for normalization
norm = (t.shape[0] + g.shape[0]) / 2
#use pandas index set logic to get the intersection
g = g.set_index(columns)
t = t.set_index(columns)
return len(g.index.intersection(t.index)) / norm
In [10]:
jnt_scores = joint.groupby('sent').apply(score)
jnt_scores.describe()
Out[10]:
In [11]:
gldp_scores = goldpipe.groupby('sent').apply(score)
gldp_scores.describe()
Out[11]:
In [12]:
pipe_scores = pipe.groupby('sent').apply(score)
pipe_scores.describe()
Out[12]:
Error analysis¶
In [13]:
jnt_scores[jnt_scores<0.2]
Out[13]:
In [14]:
joint[joint.sent==155]
Out[14]:
In [15]:
gold[gold.sent==155]
Out[15]:
In [16]:
joint[joint.sent==312]
Out[16]:
In [17]:
gold[gold.sent==312]
Out[17]:
Sentence length¶
In [18]:
joint_sent_len = joint.groupby('sent').size()
ax = sns.regplot(x=joint_sent_len, y=jnt_scores)
ax.set_xlabel('Joint Sentence Length')
ax.set_ylabel('Score')
Out[18]:
In [19]:
gldp_sent_len = goldpipe.groupby('sent').size()
ax = sns.regplot(x=gldp_sent_len, y=gldp_scores)
ax.set_xlabel('Goldpipe Sentence Length')
ax.set_ylabel('Score')
Out[19]:
In [20]:
pipe_sent_len = pipe.groupby('sent').size()
ax = sns.regplot(x=pipe_sent_len, y=pipe_scores)
ax.set_xlabel('Pipe Sentence Length')
ax.set_ylabel('Score')
Out[20]:
Sentence length diff¶
In [21]:
gold_sent_len = gold.groupby('sent').size()
ax = sns.boxplot(x=gold_sent_len - pipe_sent_len, y=pipe_scores)
ax.set_xlabel('Gold Length - Pipe Length')
ax.set_ylabel('Score')
ax.set_title('Sentence Length Diff')
Out[21]:
In [22]:
ax = sns.boxplot(x=gold_sent_len - joint_sent_len, y=pipe_scores)
ax.set_xlabel('Gold Length - Joint Length')
ax.set_ylabel('Score')
ax.set_title('Sentence Length Diff')
Out[22]:
Punctuation¶
Correlate punctuation and score¶
In [23]:
joint['puncts'] = joint.UPOS.str.startswith('yy')
joint_pnct_ratio = joint.groupby('sent').puncts.sum() / joint.groupby('sent').size()
g = (sns.jointplot(x=joint_pnct_ratio, y=jnt_scores, kind="reg", size=8, ratio=6)
.set_axis_labels("Punct/Sentence Length Ratio", "Score"))
There is a negative correlation here. Next lets try to remove punctuation and re-evaluate the score.
Remove punctuation chars and re-evaluate¶
In [24]:
def depunct(df):
new_df = df.fillna({'head_upos': '___', 'head_form': '___'}).copy()
new_df = new_df[~new_df.UPOS.str.startswith('yy')]
new_df.loc[new_df['head_upos'].str.startswith('yy', na=False), 'head_form'] = '___'
return new_df
pipe_nop = depunct(pipe)
gold_nop = depunct(gold)
gldp_nop = depunct(goldpipe)
jnt_nop = depunct(joint)
In [25]:
gold_nop[gold_nop.sent==155]
Out[25]:
In [26]:
gold[gold.sent==155]
Out[26]:
In [27]:
gold_nop[gold_nop.sent==312]
Out[27]:
In [28]:
gold[gold.sent==312]
Out[28]:
In [29]:
jnt_nop_score = jnt_nop.groupby('sent').apply(score, g=gold_nop)
jnt_nop_score.describe()
Out[29]:
In [30]:
gldp_nop_score = gldp_nop.groupby('sent').apply(score, g=gold_nop)
gldp_nop_score.describe()
Out[30]:
In [31]:
pipe_nop_score = pipe_nop.groupby('sent').apply(score, g=gold_nop)
pipe_nop_score.describe()
Out[31]:
High scoring with segmentation errors¶
In [36]:
pipe_nop_score[(pipe_nop_score>0.9) & (pipe_nop.groupby('sent').size()!=gold_nop.groupby('sent').size())]
Out[36]:
In [37]:
pd.concat([pipe_nop.loc[pipe_nop.sent==379, EVAL_COLS].set_index(np.arange(pipe_nop[pipe_nop.sent==379].shape[0])),
gold_nop.loc[gold_nop.sent==379, EVAL_COLS].set_index(np.arange(gold_nop[gold_nop.sent==379].shape[0]))],
axis=1, keys=['pipe', 'gold'])
Out[37]:
Nice!
There is only one segmentation error in position 3, and this is how the score gets it too.
Comments !